1. 线性模块,一阶特征(未经过显示特征交叉处理),对应论文中的$l_z=(l_z^1,l_z^2, ..., l_z^{D_1})$
@@ -230,7 +230,7 @@ class ProductLayer(Layer):
下面是一个通过keras画的模型结构图,为了更好的显示,类别特征都只是选择了一小部分,画图的代码也在github中。
-
+
## 思考题
diff --git a/docs/ch02/ch2.2/ch2.2.3/AFM.md b/docs/ch02/ch2.2/ch2.2.3/AFM.md
index 82164853d..1de9803ac 100644
--- a/docs/ch02/ch2.2/ch2.2.3/AFM.md
+++ b/docs/ch02/ch2.2/ch2.2.3/AFM.md
@@ -9,7 +9,7 @@ $$
## AFM模型原理
-
+
上图表示的就是AFM交叉特征部分的模型结构(非交叉部分与FM是一样的,图中并没有给出)。AFM最核心的两个点分别是Pair-wise Interaction Layer和Attention-based Pooling。前者将输入的非零特征的隐向量两两计算element-wise product(哈达玛积,两个向量对应元素相乘,得到的还是一个向量),假如输入的特征中的非零向量的数量为m,那么经过Pair-wise Interaction Layer之后输出的就是$\frac{m(m-1)}{2}$个向量,再将前面得到的交叉特征向量组输入到Attention-based Pooling,该pooling层会先计算出每个特征组合的自适应权重(通过Attention Net进行计算),通过加权求和的方式将向量组压缩成一个向量,由于最终需要输出的是一个数值,所以还需要将前一步得到的向量通过另外一个向量将其映射成一个值,得到最终的基于注意力加权的二阶交叉特征的输出。(对于这部分如果不是很清楚,可以先看下面对两个核心层的介绍)
@@ -109,13 +109,13 @@ def AFM(linear_feature_columns, dnn_feature_columns):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-
+
## 思考
diff --git a/docs/ch02/ch2.2/ch2.2.3/DeepFM.md b/docs/ch02/ch2.2/ch2.2.3/DeepFM.md
index 93d532fab..c03efaf7a 100644
--- a/docs/ch02/ch2.2/ch2.2.3/DeepFM.md
+++ b/docs/ch02/ch2.2/ch2.2.3/DeepFM.md
@@ -7,17 +7,17 @@
- **DNN局限**
当我们使用DNN网络解决推荐问题的时候存在网络参数过于庞大的问题,这是因为在进行特征处理的时候我们需要使用one-hot编码来处理离散特征,这会导致输入的维度猛增。这里借用AI大会的一张图片:
-
+
这样庞大的参数量也是不实际的。为了解决DNN参数量过大的局限性,可以采用非常经典的Field思想,将OneHot特征转换为Dense Vector
-
+
此时通过增加全连接层就可以实现高阶的特征组合,如下图所示:
-
+
但是仍然缺少低阶的特征组合,于是增加FM来表示低阶的特征组合。
@@ -25,7 +25,7 @@
结合FM和DNN其实有两种方式,可以并行结合也可以串行结合。这两种方式各有几种代表模型。在DeepFM之前有FNN,虽然在影响力上可能并不如DeepFM,但是了解FNN的思想对我们理解DeepFM的特点和优点是很有帮助的。
-
+
FNN是使用预训练好的FM模块,得到隐向量,然后把隐向量作为DNN的输入,但是经过实验进一步发现,在Embedding layer和hidden layer1之间增加一个product层(如上图所示)可以提高模型的表现,所以提出了PNN,使用product layer替换FM预训练层。
@@ -33,7 +33,7 @@ FNN是使用预训练好的FM模块,得到隐向量,然后把隐向量作为
- **Wide&Deep**
FNN和PNN模型仍然有一个比较明显的尚未解决的缺点:对于低阶组合特征学习到的比较少,这一点主要是由于FM和DNN的串行方式导致的,也就是虽然FM学到了低阶特征组合,但是DNN的全连接结构导致低阶特征并不能在DNN的输出端较好的表现。看来我们已经找到问题了,将串行方式改进为并行方式能比较好的解决这个问题。于是Google提出了Wide&Deep模型(将前几章),但是如果深入探究Wide&Deep的构成方式,虽然将整个模型的结构调整为了并行结构,在实际的使用中Wide Module中的部分需要较为精巧的特征工程,换句话说人工处理对于模型的效果具有比较大的影响(这一点可以在Wide&Deep模型部分得到验证)。
-
+
如上图所示,该模型仍然存在问题:**在output Units阶段直接将低阶和高阶特征进行组合,很容易让模型最终偏向学习到低阶或者高阶的特征,而不能做到很好的结合。**
@@ -41,7 +41,7 @@ FNN和PNN模型仍然有一个比较明显的尚未解决的缺点:对于低
## 模型的结构与原理
-
+
前面的Field和Embedding处理是和前面的方法是相同的,如上图中的绿色部分;DeepFM将Wide部分替换为了FM layer如上图中的蓝色部分
@@ -58,12 +58,12 @@ $$
\hat{y}_{FM}(x) = w_0+\sum_{i=1}^N w_ix_i + \sum_{i=1}^N \sum_{j=i+1}^N v_i^T v_j x_ix_j
$$
-
+
### Deep
Deep架构图
-
+
Deep Module是为了学习高阶的特征组合,在上图中使用用全连接的方式将Dense Embedding输入到Hidden Layer,这里面Dense Embeddings就是为了解决DNN中的参数爆炸问题,这也是推荐模型中常用的处理方法。
@@ -130,13 +130,13 @@ def DeepFM(linear_feature_columns, dnn_feature_columns):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-
+
## 思考
@@ -144,7 +144,7 @@ def DeepFM(linear_feature_columns, dnn_feature_columns):
2. 对于下图所示,根据你的理解Sparse Feature中的不同颜色节点分别表示什么意思
-
+
diff --git a/docs/ch02/ch2.2/ch2.2.3/NFM.md b/docs/ch02/ch2.2/ch2.2.3/NFM.md
index 0ccd2caa3..e7945ff00 100644
--- a/docs/ch02/ch2.2/ch2.2.3/NFM.md
+++ b/docs/ch02/ch2.2/ch2.2.3/NFM.md
@@ -10,11 +10,11 @@ $$
我们对比FM, 就会发现变化的是第三项,前两项还是原来的, 因为我们说FM的一个问题,就是只能到二阶交叉, 且是线性模型, 这是他本身的一个局限性, 而如果想突破这个局限性, 就需要从他的公式本身下点功夫, 于是乎,作者在这里改进的思路就是**用一个表达能力更强的函数来替代原FM中二阶隐向量内积的部分**。
-
+
而这个表达能力更强的函数呢, 我们很容易就可以想到神经网络来充当,因为神经网络理论上可以拟合任何复杂能力的函数, 所以作者真的就把这个$f(x)$换成了一个神经网络,当然不是一个简单的DNN, 而是依然底层考虑了交叉,然后高层使用的DNN网络, 这个也就是我们最终的NFM网络了:
-
+
这个结构,如果前面看过了PNN的伙伴会发现,这个结构和PNN非常像,只不过那里是一个product_layer, 而这里换成了Bi-Interaction Pooling了, 这个也是NFM的核心结构了。这里注意, 这个结构中,忽略了一阶部分,只可视化出来了$f(x)$, 我们还是下面从底层一点点的对这个网络进行剖析。
@@ -130,11 +130,11 @@ def NFM(linear_feature_columns, dnn_feature_columns):
有了上面的解释,这个模型的宏观层面相信就很容易理解了。关于这每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-
+
## 思考题
diff --git a/docs/ch02/ch2.2/ch2.2.3/WideNDeep.md b/docs/ch02/ch2.2/ch2.2.3/WideNDeep.md
index a42f0872f..504ef9cb4 100644
--- a/docs/ch02/ch2.2/ch2.2.3/WideNDeep.md
+++ b/docs/ch02/ch2.2/ch2.2.3/WideNDeep.md
@@ -12,7 +12,7 @@ Wide&Deep模型就是围绕记忆性和泛化性进行讨论的,模型能够
## 模型结构及原理
-
+
其实wide&deep模型本身的结构是非常简单的,对于有点机器学习基础和深度学习基础的人来说都非常的容易看懂,但是如何根据自己的场景去选择那些特征放在Wide部分,哪些特征放在Deep部分就需要理解这篇论文提出者当时对于设计该模型不同结构时的意图了,所以这也是用好这个模型的一个前提。
@@ -88,13 +88,13 @@ def WideNDeep(linear_feature_columns, dnn_feature_columns):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-
+
## 思考
diff --git a/docs/ch02/ch2.2/ch2.2.4/DIEN.md b/docs/ch02/ch2.2/ch2.2.4/DIEN.md
index 37a21e713..9be1b388d 100644
--- a/docs/ch02/ch2.2/ch2.2.4/DIEN.md
+++ b/docs/ch02/ch2.2/ch2.2.4/DIEN.md
@@ -6,7 +6,7 @@ DIN模型考虑了用户兴趣,并且强调用户兴趣是多样的,该模
## DIEN模型原理
-
+
模型的输入可以分成两大部分,一部分是用户的行为序列(这部分会通过兴趣提取层及兴趣演化层转换成与用户当前兴趣相关的embedding),另一部分就是除了用户行为以外的其他所有特征,如Target id, Coontext Feature, UserProfile Feature,这些特征都转化成embedding的类型然后concat在一起(形成一个大的embedding)作为非行为相关的特征(这里可能也会存在一些非id类特征,应该可以直接进行concat)。最后DNN输入的部分由行为序列embedding和非行为特征embedding(多个特征concat到一起之后形成的一个大的向量)组成,将两者concat之后输入到DNN中。
@@ -23,13 +23,13 @@ DIN模型考虑了用户兴趣,并且强调用户兴趣是多样的,该模
首先需要明确的就是辅助损失是计算哪两个量的损失。计算的是用户每个时刻的兴趣表示(GRU每个时刻输出的隐藏状态形成的序列)与用户当前时刻实际点击的物品表示(输入的embedding序列)之间的损失,相当于是行为序列中的第t+1个物品与用户第t时刻的兴趣表示之间的损失**(为什么这里用户第t时刻的兴趣与第t+1时刻的真实点击做损失呢?我的理解是,只有知道了用户第t+1真实点击的商品,才能更好的确定用户第t时刻的兴趣)。**
-
+
当然,如果只计算用户点击物品与其点击前一次的兴趣之间的损失,只能认为是正样本之间的损失,那么用户第t时刻的兴趣其实还有很多其他的未点击的商品,这些未点击的商品就是负样本,负样本一般通过从用户点击序列中采样得到,这样一来辅助损失中就包含了用户某个时刻下的兴趣及与该时刻兴趣相关的正负物品。所以最终的损失函数表示如下。
-
+
其中$h_t^i$表示的是用户$i$第$t$时刻的隐藏状态,可以表示用户第$t$时刻的兴趣向量,$e_b^i,\hat{e_b^i}$分别表示的是正负样本,$e_b^i[t+1]$表示的是用户$i$第$t+1$时刻点击的物品向量。
@@ -56,7 +56,7 @@ $$
由于用户的兴趣是多样的,但是用户的每一种兴趣都有自己的发展过程,即使兴趣发生漂移我们可以只考虑用户与target item(广告或者商品)相关的兴趣演化过程,这样就不用考虑用户多样化的兴趣的问题了,而如何只获取与target item相关的信息,作者使用了与DIN模型中提取与target item相同的方法,来计算用户历史兴趣与target item之间的相似度,即这里也使用了DIN中介绍的局部激活单元(就是下图中的Attention模块)。
-
+
当得到了用户历史兴趣序列及兴趣序列与target item之间的相关性(注意力分数)之后,就需要再次对注意力序列进行建模得到用户注意力的演化过程,进一步表示用户最终的兴趣向量。此时的序列数据等同于有了一个序列及序列中每个向量的注意力权重,下面就是考虑如何使用这个注意力权重来一起优化序列建模的结果了。作者提出了三种注意力结合的GRU模型快:
diff --git a/docs/ch02/ch2.2/ch2.2.4/DIN.md b/docs/ch02/ch2.2/ch2.2.4/DIN.md
index f99a9c5fa..dc104de6f 100644
--- a/docs/ch02/ch2.2/ch2.2.4/DIN.md
+++ b/docs/ch02/ch2.2/ch2.2.4/DIN.md
@@ -158,13 +158,13 @@ def DIN(feature_columns, behavior_feature_list, behavior_seq_feature_list):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-
+
## 思考
diff --git "a/docs/ch03/ch3.1/jupyter/\345\244\232\350\267\257\345\217\254\345\233\236.ipynb" "b/docs/ch03/ch3.1/jupyter/\345\244\232\350\267\257\345\217\254\345\233\236.ipynb"
index 3a4bccd4e..08bc05222 100644
--- "a/docs/ch03/ch3.1/jupyter/\345\244\232\350\267\257\345\217\254\345\233\236.ipynb"
+++ "b/docs/ch03/ch3.1/jupyter/\345\244\232\350\267\257\345\217\254\345\233\236.ipynb"
@@ -1,2107 +1,2107 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 多路召回\n",
- "\n",
- "所谓的“多路召回”策略,就是指采用不同的策略、特征或简单模型,分别召回一部分候选集,然后把候选集混合在一起供后续排序模型使用,可以明显的看出,“多路召回策略”是在“计算速度”和“召回率”之间进行权衡的结果。其中,各种简单策略保证候选集的快速召回,从不同角度设计的策略保证召回率接近理想的状态,不至于损伤排序效果。如下图是多路召回的一个示意图,在多路召回中,每个策略之间毫不相关,所以一般可以写并发多线程同时进行,这样可以更加高效。\n",
- "\n",
- "
\n",
- "\n",
- "上图只是一个多路召回的例子,也就是说可以使用多种不同的策略来获取用户排序的候选商品集合,而具体使用哪些召回策略其实是与业务强相关的 ,针对不同的任务就会有对于该业务真实场景下需要考虑的召回规则。例如新闻推荐,召回规则可以是“热门新闻”、“作者召回”、“关键词召回”、“主题召回“、”协同过滤召回“等等。 \n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:29.834662Z",
- "start_time": "2020-11-16T11:26:27.811511Z"
- }
- },
- "outputs": [],
- "source": [
- "import pandas as pd \n",
- "import numpy as np\n",
- "from tqdm import tqdm \n",
- "from collections import defaultdict \n",
- "import os, math, warnings, math, pickle\n",
- "from tqdm import tqdm\n",
- "import faiss\n",
- "import collections\n",
- "import random\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "from sklearn.preprocessing import LabelEncoder\n",
- "from datetime import datetime\n",
- "from deepctr.feature_column import SparseFeat, VarLenSparseFeat\n",
- "from sklearn.preprocessing import LabelEncoder\n",
- "from tensorflow.python.keras import backend as K\n",
- "from tensorflow.python.keras.models import Model\n",
- "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
- "\n",
- "from deepmatch.models import *\n",
- "from deepmatch.utils import sampledsoftmaxloss\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:31.831215Z",
- "start_time": "2020-11-16T11:26:31.826939Z"
- }
- },
- "outputs": [],
- "source": [
- "data_path = './data_raw/'\n",
- "save_path = './temp_results/'\n",
- "# 做召回评估的一个标志, 如果不进行评估就是直接使用全量数据进行召回\n",
- "metric_recall = False"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取数据\n",
- "在一般的rs比赛中读取数据部分主要分为三种模式, 不同的模式对应的不同的数据集:\n",
- "1. debug模式: 这个的目的是帮助我们基于数据先搭建一个简易的baseline并跑通, 保证写的baseline代码没有什么问题。 由于推荐比赛的数据往往非常巨大, 如果一上来直接采用全部的数据进行分析,搭建baseline框架, 往往会带来时间和设备上的损耗, **所以这时候我们往往需要从海量数据的训练集中随机抽取一部分样本来进行调试(train_click_log_sample)**, 先跑通一个baseline。\n",
- "2. 线下验证模式: 这个的目的是帮助我们在线下基于已有的训练集数据, 来选择好合适的模型和一些超参数。 **所以我们这一块只需要加载整个训练集(train_click_log)**, 然后把整个训练集再分成训练集和验证集。 训练集是模型的训练数据, 验证集部分帮助我们调整模型的参数和其他的一些超参数。\n",
- "3. 线上模式: 我们用debug模式搭建起一个推荐系统比赛的baseline, 用线下验证模式选择好了模型和一些超参数, 这一部分就是真正的对于给定的测试集进行预测, 提交到线上, **所以这一块使用的训练数据集是全量的数据集(train_click_log+test_click_log)**\n",
- "\n",
- "下面就分别对这三种不同的数据读取模式先建立不同的代导入函数, 方便后面针对不同的模式下导入数据。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:34.476240Z",
- "start_time": "2020-11-16T11:26:34.467352Z"
- }
- },
- "outputs": [],
- "source": [
- "# debug模式: 从训练集中划出一部分数据来调试代码\n",
- "def get_all_click_sample(data_path, sample_nums=10000):\n",
- " \"\"\"\n",
- " 训练集中采样一部分数据调试\n",
- " data_path: 原数据的存储路径\n",
- " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
- " \"\"\"\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " all_user_ids = all_click.user_id.unique()\n",
- "\n",
- " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
- " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click\n",
- "\n",
- "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
- "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
- "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
- " if offline:\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " else:\n",
- " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
- "\n",
- " all_click = trn_click.append(tst_click)\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:35.168738Z",
- "start_time": "2020-11-16T11:26:35.163210Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取文章的基本属性\n",
- "def get_item_info_df(data_path):\n",
- " item_info_df = pd.read_csv(data_path + 'articles.csv')\n",
- " \n",
- " # 为了方便与训练集中的click_article_id拼接,需要把article_id修改成click_article_id\n",
- " item_info_df = item_info_df.rename(columns={'article_id': 'click_article_id'})\n",
- " \n",
- " return item_info_df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:36.152958Z",
- "start_time": "2020-11-16T11:26:36.146324Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取文章的Embedding数据\n",
- "def get_item_emb_dict(data_path):\n",
- " item_emb_df = pd.read_csv(data_path + 'articles_emb.csv')\n",
- " \n",
- " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
- " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols])\n",
- " # 进行归一化\n",
- " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
- "\n",
- " item_emb_dict = dict(zip(item_emb_df['article_id'], item_emb_np))\n",
- " pickle.dump(item_emb_dict, open(save_path + 'item_content_emb.pkl', 'wb'))\n",
- " \n",
- " return item_emb_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:37.333536Z",
- "start_time": "2020-11-16T11:26:37.329545Z"
- }
- },
- "outputs": [],
- "source": [
- "max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:42.163494Z",
- "start_time": "2020-11-16T11:26:38.018094Z"
- }
- },
- "outputs": [],
- "source": [
- "# 采样数据\n",
- "# all_click_df = get_all_click_sample(data_path)\n",
- "\n",
- "# 全量训练集\n",
- "all_click_df = get_all_click_df(offline=False)\n",
- "\n",
- "# 对时间戳进行归一化,用于在关联规则的时候计算权重\n",
- "all_click_df['click_timestamp'] = all_click_df[['click_timestamp']].apply(max_min_scaler)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:44.343500Z",
- "start_time": "2020-11-16T11:26:44.113891Z"
- }
- },
- "outputs": [],
- "source": [
- "item_info_df = get_item_info_df(data_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:24.295343Z",
- "start_time": "2020-11-16T11:26:44.398007Z"
- }
- },
- "outputs": [],
- "source": [
- "item_emb_dict = get_item_emb_dict(data_path)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 工具函数"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取用户-文章-时间函数\n",
- "这个在基于关联规则的用户协同过滤的时候会用到"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:33.791656Z",
- "start_time": "2020-11-16T11:27:33.784305Z"
- }
- },
- "outputs": [],
- "source": [
- "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- "def get_user_item_time(click_df):\n",
- " \n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " \n",
- " def make_item_time_pair(df):\n",
- " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
- " \n",
- " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
- " .reset_index().rename(columns={0: 'item_time_list'})\n",
- " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
- " \n",
- " return user_item_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取文章-用户-时间函数\n",
- "这个在基于关联规则的文章协同过滤的时候会用到"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:38.327581Z",
- "start_time": "2020-11-16T11:27:38.321059Z"
- }
- },
- "outputs": [],
- "source": [
- "# 根据时间获取商品被点击的用户序列 {item1: [(user1, time1), (user2, time2)...]...}\n",
- "# 这里的时间是用户点击当前商品的时间,好像没有直接的关系。\n",
- "def get_item_user_time_dict(click_df):\n",
- " def make_user_time_pair(df):\n",
- " return list(zip(df['user_id'], df['click_timestamp']))\n",
- " \n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " item_user_time_df = click_df.groupby('click_article_id')['user_id', 'click_timestamp'].apply(lambda x: make_user_time_pair(x))\\\n",
- " .reset_index().rename(columns={0: 'user_time_list'})\n",
- " \n",
- " item_user_time_dict = dict(zip(item_user_time_df['click_article_id'], item_user_time_df['user_time_list']))\n",
- " return item_user_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取历史和最后一次点击\n",
- "这个在评估召回结果, 特征工程和制作标签转成监督学习测试集的时候回用到"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:50.894683Z",
- "start_time": "2020-11-16T11:27:50.888002Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取当前数据的历史点击和最后一次点击\n",
- "def get_hist_and_last_click(all_click):\n",
- " \n",
- " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
- " click_last_df = all_click.groupby('user_id').tail(1)\n",
- "\n",
- " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
- " def hist_func(user_df):\n",
- " if len(user_df) == 1:\n",
- " return user_df\n",
- " else:\n",
- " return user_df[:-1]\n",
- "\n",
- " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
- "\n",
- " return click_hist_df, click_last_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取文章属性特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:55.893810Z",
- "start_time": "2020-11-16T11:27:55.887623Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取文章id对应的基本属性,保存成字典的形式,方便后面召回阶段,冷启动阶段直接使用\n",
- "def get_item_info_dict(item_info_df):\n",
- " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
- " item_info_df['created_at_ts'] = item_info_df[['created_at_ts']].apply(max_min_scaler)\n",
- " \n",
- " item_type_dict = dict(zip(item_info_df['click_article_id'], item_info_df['category_id']))\n",
- " item_words_dict = dict(zip(item_info_df['click_article_id'], item_info_df['words_count']))\n",
- " item_created_time_dict = dict(zip(item_info_df['click_article_id'], item_info_df['created_at_ts']))\n",
- " \n",
- " return item_type_dict, item_words_dict, item_created_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T06:42:38.730939Z",
- "start_time": "2020-11-13T06:42:38.728461Z"
- }
- },
- "source": [
- "### 获取用户历史点击的文章信息"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:59.650781Z",
- "start_time": "2020-11-16T11:27:59.640572Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_user_hist_item_info_dict(all_click):\n",
- " \n",
- " # 获取user_id对应的用户历史点击文章类型的集合字典\n",
- " user_hist_item_typs = all_click.groupby('user_id')['category_id'].agg(set).reset_index()\n",
- " user_hist_item_typs_dict = dict(zip(user_hist_item_typs['user_id'], user_hist_item_typs['category_id']))\n",
- " \n",
- " # 获取user_id对应的用户点击文章的集合\n",
- " user_hist_item_ids_dict = all_click.groupby('user_id')['click_article_id'].agg(set).reset_index()\n",
- " user_hist_item_ids_dict = dict(zip(user_hist_item_ids_dict['user_id'], user_hist_item_ids_dict['click_article_id']))\n",
- " \n",
- " # 获取user_id对应的用户历史点击的文章的平均字数字典\n",
- " user_hist_item_words = all_click.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
- " user_hist_item_words_dict = dict(zip(user_hist_item_words['user_id'], user_hist_item_words['words_count']))\n",
- " \n",
- " # 获取user_id对应的用户最后一次点击的文章的创建时间\n",
- " all_click_ = all_click.sort_values('click_timestamp')\n",
- " user_last_item_created_time = all_click_.groupby('user_id')['created_at_ts'].apply(lambda x: x.iloc[-1]).reset_index()\n",
- " \n",
- " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
- " user_last_item_created_time['created_at_ts'] = user_last_item_created_time[['created_at_ts']].apply(max_min_scaler)\n",
- " \n",
- " user_last_item_created_time_dict = dict(zip(user_last_item_created_time['user_id'], \\\n",
- " user_last_item_created_time['created_at_ts']))\n",
- " \n",
- " return user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取点击次数最多的topk个文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:28:04.761105Z",
- "start_time": "2020-11-16T11:28:04.756419Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取近期点击最多的文章\n",
- "def get_item_topk_click(click_df, k):\n",
- " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
- " return topk_click"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 定义多路召回字典"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:28:08.321506Z",
- "start_time": "2020-11-16T11:28:07.623281Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取文章的属性信息,保存成字典的形式方便查询\n",
- "item_type_dict, item_words_dict, item_created_time_dict = get_item_info_dict(item_info_df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:28:13.791569Z",
- "start_time": "2020-11-16T11:28:13.786522Z"
- }
- },
- "outputs": [],
- "source": [
- "# 定义一个多路召回的字典,将各路召回的结果都保存在这个字典当中\n",
- "user_multi_recall_dict = {'itemcf_sim_itemcf_recall': {},\n",
- " 'embedding_sim_item_recall': {},\n",
- " 'youtubednn_recall': {},\n",
- " 'youtubednn_usercf_recall': {}, \n",
- " 'cold_start_recall': {}}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T05:41:12.710754Z",
- "start_time": "2020-11-16T05:40:57.842614Z"
- }
- },
- "outputs": [],
- "source": [
- "# 提取最后一次点击作为召回评估,如果不需要做召回评估直接使用全量的训练集进行召回(线下验证模型)\n",
- "# 如果不是召回评估,直接使用全量数据进行召回,不用将最后一次提取出来\n",
- "trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 召回效果评估函数\n",
- "做完了召回有时候也需要对当前的召回方法或者参数进行调整以达到更好的召回效果,因为召回的结果决定了最终排序的上限,下面也会提供一个召回评估的方法"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T05:41:18.579118Z",
- "start_time": "2020-11-16T05:41:18.571887Z"
- }
- },
- "outputs": [],
- "source": [
- "# 依次评估召回的前10, 20, 30, 40, 50个文章中的击中率\n",
- "def metrics_recall(user_recall_items_dict, trn_last_click_df, topk=5):\n",
- " last_click_item_dict = dict(zip(trn_last_click_df['user_id'], trn_last_click_df['click_article_id']))\n",
- " user_num = len(user_recall_items_dict)\n",
- " \n",
- " for k in range(10, topk+1, 10):\n",
- " hit_num = 0\n",
- " for user, item_list in user_recall_items_dict.items():\n",
- " # 获取前k个召回的结果\n",
- " tmp_recall_items = [x[0] for x in user_recall_items_dict[user][:k]]\n",
- " if last_click_item_dict[user] in set(tmp_recall_items):\n",
- " hit_num += 1\n",
- " \n",
- " hit_rate = round(hit_num * 1.0 / user_num, 5)\n",
- " print(' topk: ', k, ' : ', 'hit_num: ', hit_num, 'hit_rate: ', hit_rate, 'user_num : ', user_num)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 计算相似性矩阵\n",
- "\n",
- "这一部分主要是通过协同过滤以及向量检索得到相似性矩阵,相似性矩阵主要分为user2user和item2item,下面依次获取基于itemcf的item2item的相似性矩阵,"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### itemcf i2i_sim\n",
- "\n",
- "借鉴KDD2020的去偏商品推荐,在计算item2item相似性矩阵时,使用关联规则,使得计算的文章的相似性还考虑到了:\n",
- "1. 用户点击的时间权重\n",
- "2. 用户点击的顺序权重\n",
- "3. 文章创建的时间权重"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:30:51.872262Z",
- "start_time": "2020-11-16T11:30:51.860099Z"
- }
- },
- "outputs": [],
- "source": [
- "def itemcf_sim(df, item_created_time_dict):\n",
- " \"\"\"\n",
- " 文章与文章之间的相似性矩阵计算\n",
- " :param df: 数据表\n",
- " :item_created_time_dict: 文章创建时间的字典\n",
- " return : 文章与文章的相似性矩阵\n",
- " \n",
- " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
- " \"\"\"\n",
- " \n",
- " user_item_time_dict = get_user_item_time(df)\n",
- " \n",
- " # 计算物品相似度\n",
- " i2i_sim = {}\n",
- " item_cnt = defaultdict(int)\n",
- " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
- " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
- " for loc1, (i, i_click_time) in enumerate(item_time_list):\n",
- " item_cnt[i] += 1\n",
- " i2i_sim.setdefault(i, {})\n",
- " for loc2, (j, j_click_time) in enumerate(item_time_list):\n",
- " if(i == j):\n",
- " continue\n",
- " \n",
- " # 考虑文章的正向顺序点击和反向顺序点击 \n",
- " loc_alpha = 1.0 if loc2 > loc1 else 0.7\n",
- " # 位置信息权重,其中的参数可以调节\n",
- " loc_weight = loc_alpha * (0.9 ** (np.abs(loc2 - loc1) - 1))\n",
- " # 点击时间权重,其中的参数可以调节\n",
- " click_time_weight = np.exp(0.7 ** np.abs(i_click_time - j_click_time))\n",
- " # 两篇文章创建时间的权重,其中的参数可以调节\n",
- " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
- " i2i_sim[i].setdefault(j, 0)\n",
- " # 考虑多种因素的权重计算最终的文章之间的相似度\n",
- " i2i_sim[i][j] += loc_weight * click_time_weight * created_time_weight / math.log(len(item_time_list) + 1)\n",
- " \n",
- " i2i_sim_ = i2i_sim.copy()\n",
- " for i, related_items in i2i_sim.items():\n",
- " for j, wij in related_items.items():\n",
- " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
- " \n",
- " # 将得到的相似性矩阵保存到本地\n",
- " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
- " \n",
- " return i2i_sim_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:47:09.937002Z",
- "start_time": "2020-11-16T11:30:57.394334Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [14:20<00:00, 290.38it/s]\n"
- ]
- }
- ],
- "source": [
- "i2i_sim = itemcf_sim(all_click_df, item_created_time_dict)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### usercf u2u_sim\n",
- "\n",
- "在计算用户之间的相似度的时候,也可以使用一些简单的关联规则,比如用户活跃度权重,这里将用户的点击次数作为用户活跃度的指标"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:11:14.951940Z",
- "start_time": "2020-11-16T09:11:14.945654Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_user_activate_degree_dict(all_click_df):\n",
- " all_click_df_ = all_click_df.groupby('user_id')['click_article_id'].count().reset_index()\n",
- " \n",
- " # 用户活跃度归一化\n",
- " mm = MinMaxScaler()\n",
- " all_click_df_['click_article_id'] = mm.fit_transform(all_click_df_[['click_article_id']])\n",
- " user_activate_degree_dict = dict(zip(all_click_df_['user_id'], all_click_df_['click_article_id']))\n",
- " \n",
- " return user_activate_degree_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:11:19.879276Z",
- "start_time": "2020-11-16T09:11:19.868808Z"
- }
- },
- "outputs": [],
- "source": [
- "def usercf_sim(all_click_df, user_activate_degree_dict):\n",
- " \"\"\"\n",
- " 用户相似性矩阵计算\n",
- " :param all_click_df: 数据表\n",
- " :param user_activate_degree_dict: 用户活跃度的字典\n",
- " return 用户相似性矩阵\n",
- " \n",
- " 思路: 基于用户的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
- " \"\"\"\n",
- " item_user_time_dict = get_item_user_time_dict(all_click_df)\n",
- " \n",
- " u2u_sim = {}\n",
- " user_cnt = defaultdict(int)\n",
- " for item, user_time_list in tqdm(item_user_time_dict.items()):\n",
- " for u, click_time in user_time_list:\n",
- " user_cnt[u] += 1\n",
- " u2u_sim.setdefault(u, {})\n",
- " for v, click_time in user_time_list:\n",
- " u2u_sim[u].setdefault(v, 0)\n",
- " if u == v:\n",
- " continue\n",
- " # 用户平均活跃度作为活跃度的权重,这里的式子也可以改善\n",
- " activate_weight = 100 * 0.5 * (user_activate_degree_dict[u] + user_activate_degree_dict[v]) \n",
- " u2u_sim[u][v] += activate_weight / math.log(len(user_time_list) + 1)\n",
- " \n",
- " u2u_sim_ = u2u_sim.copy()\n",
- " for u, related_users in u2u_sim.items():\n",
- " for v, wij in related_users.items():\n",
- " u2u_sim_[u][v] = wij / math.sqrt(user_cnt[u] * user_cnt[v])\n",
- " \n",
- " # 将得到的相似性矩阵保存到本地\n",
- " pickle.dump(u2u_sim_, open(save_path + 'usercf_u2u_sim.pkl', 'wb'))\n",
- "\n",
- " return u2u_sim_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T06:59:46.701572Z",
- "start_time": "2020-11-16T06:59:26.852246Z"
- }
- },
- "outputs": [],
- "source": [
- "# 由于usercf计算时候太耗费内存了,这里就不直接运行了\n",
- "# 如果是采样的话,是可以运行的\n",
- "user_activate_degree_dict = get_user_activate_degree_dict(all_click_df)\n",
- "u2u_sim = usercf_sim(all_click_df, user_activate_degree_dict)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### item embedding sim\n",
- "\n",
- "使用Embedding计算item之间的相似度是为了后续冷启动的时候可以获取未出现在点击数据中的文章,后面有对冷启动专门的介绍,这里简单的说一下faiss。\n",
- "\n",
- "aiss是Facebook的AI团队开源的一套用于做聚类或者相似性搜索的软件库,底层是用C++实现。Faiss因为超级优越的性能,被广泛应用于推荐相关的业务当中.\n",
- "\n",
- "faiss工具包一般使用在推荐系统中的向量召回部分。在做向量召回的时候要么是u2u,u2i或者i2i,这里的u和i指的是user和item.我们知道在实际的场景中user和item的数量都是海量的,我们最容易想到的基于向量相似度的召回就是使用两层循环遍历user列表或者item列表计算两个向量的相似度,但是这样做在面对海量数据是不切实际的,faiss就是用来加速计算某个查询向量最相似的topk个索引向量。\n",
- "\n",
- "**faiss查询的原理:**\n",
- "\n",
- "faiss使用了PCA和PQ(Product quantization乘积量化)两种技术进行向量压缩和编码,当然还使用了其他的技术进行优化,但是PCA和PQ是其中最核心部分。\n",
- "\n",
- "1. PCA降维算法细节参考下面这个链接进行学习 \n",
- "[主成分分析(PCA)原理总结](https://www.cnblogs.com/pinard/p/6239403.html) \n",
- "\n",
- "2. PQ编码的细节下面这个链接进行学习 \n",
- "[实例理解product quantization算法](http://www.fabwrite.com/productquantization)\n",
- "\n",
- "**faiss使用**\n",
- "\n",
- "[faiss官方教程](https://github.com/facebookresearch/faiss/wiki/Getting-started)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:11:28.631803Z",
- "start_time": "2020-11-16T09:11:28.619926Z"
- }
- },
- "outputs": [],
- "source": [
- "# 向量检索相似度计算\n",
- "# topk指的是每个item, faiss搜索后返回最相似的topk个item\n",
- "def embdding_sim(click_df, item_emb_df, save_path, topk):\n",
- " \"\"\"\n",
- " 基于内容的文章embedding相似性矩阵计算\n",
- " :param click_df: 数据表\n",
- " :param item_emb_df: 文章的embedding\n",
- " :param save_path: 保存路径\n",
- " :patam topk: 找最相似的topk篇\n",
- " return 文章相似性矩阵\n",
- " \n",
- " 思路: 对于每一篇文章, 基于embedding的相似性返回topk个与其最相似的文章, 只不过由于文章数量太多,这里用了faiss进行加速\n",
- " \"\"\"\n",
- " \n",
- " # 文章索引与文章id的字典映射\n",
- " item_idx_2_rawid_dict = dict(zip(item_emb_df.index, item_emb_df['article_id']))\n",
- " \n",
- " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
- " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols].values, dtype=np.float32)\n",
- " # 向量进行单位化\n",
- " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
- " \n",
- " # 建立faiss索引\n",
- " item_index = faiss.IndexFlatIP(item_emb_np.shape[1])\n",
- " item_index.add(item_emb_np)\n",
- " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
- " sim, idx = item_index.search(item_emb_np, topk) # 返回的是列表\n",
- " \n",
- " # 将向量检索的结果保存成原始id的对应关系\n",
- " item_sim_dict = collections.defaultdict(dict)\n",
- " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(item_emb_np)), sim, idx)):\n",
- " target_raw_id = item_idx_2_rawid_dict[target_idx]\n",
- " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
- " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
- " rele_raw_id = item_idx_2_rawid_dict[rele_idx]\n",
- " item_sim_dict[target_raw_id][rele_raw_id] = item_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
- " \n",
- " # 保存i2i相似度矩阵\n",
- " pickle.dump(item_sim_dict, open(save_path + 'emb_i2i_sim.pkl', 'wb')) \n",
- " \n",
- " return item_sim_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:32:35.926116Z",
- "start_time": "2020-11-16T09:11:44.586967Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "364047it [00:23, 15292.14it/s]\n"
- ]
- }
- ],
- "source": [
- "item_emb_df = pd.read_csv(data_path + '/articles_emb.csv')\n",
- "emb_i2i_sim = embdding_sim(all_click_df, item_emb_df, save_path, topk=10) # topk可以自行设置"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 召回\n",
- "这个就是我们开篇提到的那个问题, 面的36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模? 我们就可以再召回阶段筛选出用户对于点击文章的候选集合, 从而降低问题的规模。召回常用的策略:\n",
- "* Youtube DNN 召回\n",
- "* 基于文章的召回\n",
- " * 文章的协同过滤\n",
- " * 基于文章embedding的召回\n",
- "* 基于用户的召回\n",
- " * 用户的协同过滤\n",
- " * 用户embedding\n",
- "\n",
- "上面的各种召回方式一部分在基于用户已经看得文章的基础上去召回与这些文章相似的一些文章, 而这个相似性的计算方式不同, 就得到了不同的召回方式, 比如文章的协同过滤, 文章内容的embedding等。还有一部分是根据用户的相似性进行推荐,对于某用户推荐与其相似的其他用户看过的文章,比如用户的协同过滤和用户embedding。 还有一种思路是类似矩阵分解的思路,先计算出用户和文章的embedding之后,就可以直接算用户和文章的相似度, 根据这个相似度进行推荐, 比如YouTube DNN。 我们下面详细来看一下每一个召回方法:"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### YoutubeDNN召回\n",
- "**(这一步是直接获取用户召回的候选文章列表)**\n",
- "\n",
- "[论文下载地址](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)\n",
- "\n",
- "**Youtubednn召回架构**\n",
- "\n",
- "![image-20201111160516562](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201111160516562.png)\n",
- "\n",
- "\n",
- "\n",
- "关于YoutubeDNN原理和应用推荐看王喆的两篇博客:\n",
- "\n",
- "1. [重读Youtube深度学习推荐系统论文,字字珠玑,惊为神文](https://zhuanlan.zhihu.com/p/52169807)\n",
- "2. [YouTube深度学习推荐系统的十大工程问题](https://zhuanlan.zhihu.com/p/52504407)\n",
- "\n",
- "\n",
- "**参考文献:**\n",
- "1. https://zhuanlan.zhihu.com/p/52169807 (YouTubeDNN原理)\n",
- "2. https://zhuanlan.zhihu.com/p/26306795 (Word2Vec知乎众赞文章) --- word2vec放到排序中的w2v的介绍部分\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:13:11.058766Z",
- "start_time": "2020-11-16T10:13:11.041084Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取双塔召回时的训练验证数据\n",
- "# negsample指的是通过滑窗构建样本的时候,负样本的数量\n",
- "def gen_data_set(data, negsample=0):\n",
- " data.sort_values(\"click_timestamp\", inplace=True)\n",
- " item_ids = data['click_article_id'].unique()\n",
- "\n",
- " train_set = []\n",
- " test_set = []\n",
- " for reviewerID, hist in tqdm(data.groupby('user_id')):\n",
- " pos_list = hist['click_article_id'].tolist()\n",
- " \n",
- " if negsample > 0:\n",
- " candidate_set = list(set(item_ids) - set(pos_list)) # 用户没看过的文章里面选择负样本\n",
- " neg_list = np.random.choice(candidate_set,size=len(pos_list)*negsample,replace=True) # 对于每个正样本,选择n个负样本\n",
- " \n",
- " # 长度只有一个的时候,需要把这条数据也放到训练集中,不然的话最终学到的embedding就会有缺失\n",
- " if len(pos_list) == 1:\n",
- " train_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
- " test_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
- " \n",
- " # 滑窗构造正负样本\n",
- " for i in range(1, len(pos_list)):\n",
- " hist = pos_list[:i]\n",
- " \n",
- " if i != len(pos_list) - 1:\n",
- " train_set.append((reviewerID, hist[::-1], pos_list[i], 1, len(hist[::-1]))) # 正样本 [user_id, his_item, pos_item, label, len(his_item)]\n",
- " for negi in range(negsample):\n",
- " train_set.append((reviewerID, hist[::-1], neg_list[i*negsample+negi], 0,len(hist[::-1]))) # 负样本 [user_id, his_item, neg_item, label, len(his_item)]\n",
- " else:\n",
- " # 将最长的那一个序列长度作为测试数据\n",
- " test_set.append((reviewerID, hist[::-1], pos_list[i],1,len(hist[::-1])))\n",
- " \n",
- " random.shuffle(train_set)\n",
- " random.shuffle(test_set)\n",
- " \n",
- " return train_set, test_set\n",
- "\n",
- "# 将输入的数据进行padding,使得序列特征的长度都一致\n",
- "def gen_model_input(train_set,user_profile,seq_max_len):\n",
- "\n",
- " train_uid = np.array([line[0] for line in train_set])\n",
- " train_seq = [line[1] for line in train_set]\n",
- " train_iid = np.array([line[2] for line in train_set])\n",
- " train_label = np.array([line[3] for line in train_set])\n",
- " train_hist_len = np.array([line[4] for line in train_set])\n",
- "\n",
- " train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0)\n",
- " train_model_input = {\"user_id\": train_uid, \"click_article_id\": train_iid, \"hist_article_id\": train_seq_pad,\n",
- " \"hist_len\": train_hist_len}\n",
- "\n",
- " return train_model_input, train_label"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:13:18.124452Z",
- "start_time": "2020-11-16T10:13:18.098284Z"
- }
- },
- "outputs": [],
- "source": [
- "def youtubednn_u2i_dict(data, topk=20): \n",
- " sparse_features = [\"click_article_id\", \"user_id\"]\n",
- " SEQ_LEN = 30 # 用户点击序列的长度,短的填充,长的截断\n",
- " \n",
- " user_profile_ = data[[\"user_id\"]].drop_duplicates('user_id')\n",
- " item_profile_ = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
- " \n",
- " # 类别编码\n",
- " features = [\"click_article_id\", \"user_id\"]\n",
- " feature_max_idx = {}\n",
- " \n",
- " for feature in features:\n",
- " lbe = LabelEncoder()\n",
- " data[feature] = lbe.fit_transform(data[feature])\n",
- " feature_max_idx[feature] = data[feature].max() + 1\n",
- " \n",
- " # 提取user和item的画像,这里具体选择哪些特征还需要进一步的分析和考虑\n",
- " user_profile = data[[\"user_id\"]].drop_duplicates('user_id')\n",
- " item_profile = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
- " \n",
- " user_index_2_rawid = dict(zip(user_profile['user_id'], user_profile_['user_id']))\n",
- " item_index_2_rawid = dict(zip(item_profile['click_article_id'], item_profile_['click_article_id']))\n",
- " \n",
- " # 划分训练和测试集\n",
- " # 由于深度学习需要的数据量通常都是非常大的,所以为了保证召回的效果,往往会通过滑窗的形式扩充训练样本\n",
- " train_set, test_set = gen_data_set(data, 0)\n",
- " # 整理输入数据,具体的操作可以看上面的函数\n",
- " train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)\n",
- " test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)\n",
- " \n",
- " # 确定Embedding的维度\n",
- " embedding_dim = 16\n",
- " \n",
- " # 将数据整理成模型可以直接输入的形式\n",
- " user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),\n",
- " VarLenSparseFeat(SparseFeat('hist_article_id', feature_max_idx['click_article_id'], embedding_dim,\n",
- " embedding_name=\"click_article_id\"), SEQ_LEN, 'mean', 'hist_len'),]\n",
- " item_feature_columns = [SparseFeat('click_article_id', feature_max_idx['click_article_id'], embedding_dim)]\n",
- " \n",
- " # 模型的定义 \n",
- " # num_sampled: 负采样时的样本数量\n",
- " model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, embedding_dim))\n",
- " # 模型编译\n",
- " model.compile(optimizer=\"adam\", loss=sampledsoftmaxloss) \n",
- " \n",
- " # 模型训练,这里可以定义验证集的比例,如果设置为0的话就是全量数据直接进行训练\n",
- " history = model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1, validation_split=0.0)\n",
- " \n",
- " # 训练完模型之后,提取训练的Embedding,包括user端和item端\n",
- " test_user_model_input = test_model_input\n",
- " all_item_model_input = {\"click_article_id\": item_profile['click_article_id'].values}\n",
- "\n",
- " user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)\n",
- " item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)\n",
- " \n",
- " # 保存当前的item_embedding 和 user_embedding 排序的时候可能能够用到,但是需要注意保存的时候需要和原始的id对应\n",
- " user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12)\n",
- " item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12)\n",
- " \n",
- " # embedding保存之前归一化一下\n",
- " user_embs = user_embs / np.linalg.norm(user_embs, axis=1, keepdims=True)\n",
- " item_embs = item_embs / np.linalg.norm(item_embs, axis=1, keepdims=True)\n",
- " \n",
- " # 将Embedding转换成字典的形式方便查询\n",
- " raw_user_id_emb_dict = {user_index_2_rawid[k]: \\\n",
- " v for k, v in zip(user_profile['user_id'], user_embs)}\n",
- " raw_item_id_emb_dict = {item_index_2_rawid[k]: \\\n",
- " v for k, v in zip(item_profile['click_article_id'], item_embs)}\n",
- " # 将Embedding保存到本地\n",
- " pickle.dump(raw_user_id_emb_dict, open(save_path + 'user_youtube_emb.pkl', 'wb'))\n",
- " pickle.dump(raw_item_id_emb_dict, open(save_path + 'item_youtube_emb.pkl', 'wb'))\n",
- " \n",
- " # faiss紧邻搜索,通过user_embedding 搜索与其相似性最高的topk个item\n",
- " index = faiss.IndexFlatIP(embedding_dim)\n",
- " # 上面已经进行了归一化,这里可以不进行归一化了\n",
- "# faiss.normalize_L2(user_embs)\n",
- "# faiss.normalize_L2(item_embs)\n",
- " index.add(item_embs) # 将item向量构建索引\n",
- " sim, idx = index.search(np.ascontiguousarray(user_embs), topk) # 通过user去查询最相似的topk个item\n",
- " \n",
- " user_recall_items_dict = collections.defaultdict(dict)\n",
- " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(test_user_model_input['user_id'], sim, idx)):\n",
- " target_raw_id = user_index_2_rawid[target_idx]\n",
- " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
- " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
- " rele_raw_id = item_index_2_rawid[rele_idx]\n",
- " user_recall_items_dict[target_raw_id][rele_raw_id] = user_recall_items_dict.get(target_raw_id, {})\\\n",
- " .get(rele_raw_id, 0) + sim_value\n",
- " \n",
- " user_recall_items_dict = {k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in user_recall_items_dict.items()}\n",
- " # 将召回的结果进行排序\n",
- " \n",
- " # 保存召回的结果\n",
- " # 这里是直接通过向量的方式得到了召回结果,相比于上面的召回方法,上面的只是得到了i2i及u2u的相似性矩阵,还需要进行协同过滤召回才能得到召回结果\n",
- " # 可以直接对这个召回结果进行评估,为了方便可以统一写一个评估函数对所有的召回结果进行评估\n",
- " pickle.dump(user_recall_items_dict, open(save_path + 'youtube_u2i_dict.pkl', 'wb'))\n",
- " return user_recall_items_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:21:46.420014Z",
- "start_time": "2020-11-16T10:13:35.351131Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [02:02<00:00, 2038.57it/s]\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "keep_dims is deprecated, use keepdims instead\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Deprecated in favor of operator or tf.math.divide.\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
- "1149673/1149673 [==============================] - 216s 188us/sample - loss: 0.1326\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "250000it [00:32, 7720.75it/s]\n"
- ]
- }
- ],
- "source": [
- "# 由于这里需要做召回评估,所以讲训练集中的最后一次点击都提取了出来\n",
- "if not metric_recall:\n",
- " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(all_click_df, topk=20)\n",
- "else:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(trn_hist_click_df, topk=20)\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['youtubednn_recall'], trn_last_click_df, topk=20)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### itemcf recall\n",
- "\n",
- "上面已经通过协同过滤,Embedding检索的方式得到了文章的相似度矩阵,下面使用协同过滤的思想,给用户召回与其历史文章相似的文章。\n",
- "这里在召回的时候,也是用了关联规则的方式:\n",
- "1. 考虑相似文章与历史点击文章顺序的权重(细节看代码)\n",
- "2. 考虑文章创建时间的权重,也就是考虑相似文章与历史点击文章创建时间差的权重\n",
- "3. 考虑文章内容相似度权重(使用Embedding计算相似文章相似度,但是这里需要注意,在Embedding的时候并没有计算所有商品两两之间的相似度,所以相似的文章与历史点击文章不存在相似度,需要做特殊处理)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:48:40.580553Z",
- "start_time": "2020-11-16T11:48:40.567130Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于商品的召回i2i\n",
- "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
- " \"\"\"\n",
- " 基于文章协同过滤的召回\n",
- " :param user_id: 用户id\n",
- " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- " :param i2i_sim: 字典,文章相似性矩阵\n",
- " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
- " :param recall_item_num: 整数, 最后的召回文章数量\n",
- " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
- " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
- " \n",
- " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
- " \"\"\"\n",
- " # 获取用户历史交互的文章\n",
- " user_hist_items = user_item_time_dict[user_id]\n",
- " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
- " \n",
- " item_rank = {}\n",
- " for loc, (i, click_time) in enumerate(user_hist_items):\n",
- " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
- " if j in user_hist_items_:\n",
- " continue\n",
- " \n",
- " # 文章创建时间差权重\n",
- " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
- " # 相似文章和历史点击文章序列中历史文章所在的位置权重\n",
- " loc_weight = (0.9 ** (len(user_hist_items) - loc))\n",
- " \n",
- " content_weight = 1.0\n",
- " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
- " content_weight += emb_i2i_sim[i][j]\n",
- " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
- " content_weight += emb_i2i_sim[j][i]\n",
- " \n",
- " item_rank.setdefault(j, 0)\n",
- " item_rank[j] += created_time_weight * loc_weight * content_weight * wij\n",
- " \n",
- " # 不足10个,用热门商品补全\n",
- " if len(item_rank) < recall_item_num:\n",
- " for i, item in enumerate(item_topk_click):\n",
- " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
- " continue\n",
- " item_rank[item] = - i - 100 # 随便给个负数就行\n",
- " if len(item_rank) == recall_item_num:\n",
- " break\n",
- " \n",
- " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
- " \n",
- " return item_rank"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### itemcf sim召回"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T14:41:23.433038Z",
- "start_time": "2020-11-16T11:48:46.286350Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [2:51:13<00:00, 24.33it/s] \n"
- ]
- }
- ],
- "source": [
- "# 先进行itemcf召回, 为了召回评估,所以提取最后一次点击\n",
- "\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "\n",
- "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
- "emb_i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb'))\n",
- "\n",
- "sim_item_topk = 20\n",
- "recall_item_num = 10\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, \\\n",
- " i2i_sim, sim_item_topk, recall_item_num, \\\n",
- " item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
- "\n",
- "user_multi_recall_dict['itemcf_sim_itemcf_recall'] = user_recall_items_dict\n",
- "pickle.dump(user_multi_recall_dict['itemcf_sim_itemcf_recall'], open(save_path + 'itemcf_recall_dict.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['itemcf_sim_itemcf_recall'], trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### embedding sim 召回"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T15:04:51.527795Z",
- "start_time": "2020-11-16T14:59:03.907519Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [04:35<00:00, 905.85it/s] \n"
- ]
- }
- ],
- "source": [
- "# 这里是为了召回评估,所以提取最后一次点击\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
- "\n",
- "sim_item_topk = 20\n",
- "recall_item_num = 10\n",
- "\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
- " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
- " \n",
- "user_multi_recall_dict['embedding_sim_item_recall'] = user_recall_items_dict\n",
- "pickle.dump(user_multi_recall_dict['embedding_sim_item_recall'], open(save_path + 'embedding_sim_item_recall.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['embedding_sim_item_recall'], trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### usercf召回\n",
- "\n",
- "基于用户协同过滤,核心思想是给用户推荐与其相似的用户历史点击文章,因为这里涉及到了相似用户的历史文章,这里仍然可以加上一些关联规则来给用户可能点击的文章进行加权,这里使用的关联规则主要是考虑相似用户的历史点击文章与被推荐用户历史点击商品的关系权重,而这里的关系就可以直接借鉴基于物品的协同过滤相似的做法,只不过这里是对被推荐物品关系的一个累加的过程,下面是使用的一些关系权重,及相关的代码:\n",
- "\n",
- "1. 计算被推荐用户历史点击文章与相似用户历史点击文章的相似度,文章创建时间差,相对位置的总和,作为各自的权重"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:09:32.293990Z",
- "start_time": "2020-11-17T02:09:32.278678Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于用户的召回 u2u2i\n",
- "def user_based_recommend(user_id, user_item_time_dict, u2u_sim, sim_user_topk, recall_item_num, \n",
- " item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
- " \"\"\"\n",
- " 基于文章协同过滤的召回\n",
- " :param user_id: 用户id\n",
- " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- " :param u2u_sim: 字典,文章相似性矩阵\n",
- " :param sim_user_topk: 整数, 选择与当前用户最相似的前k个用户\n",
- " :param recall_item_num: 整数, 最后的召回文章数量\n",
- " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
- " :param item_created_time_dict: 文章创建时间列表\n",
- " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
- " \n",
- " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
- " \"\"\"\n",
- " # 历史交互\n",
- " user_item_time_list = user_item_time_dict[user_id] # [(item1, time1), (item2, time2)..]\n",
- " user_hist_items = set([i for i, t in user_item_time_list]) # 存在一个用户与某篇文章的多次交互, 这里得去重\n",
- " \n",
- " items_rank = {}\n",
- " for sim_u, wuv in sorted(u2u_sim[user_id].items(), key=lambda x: x[1], reverse=True)[:sim_user_topk]:\n",
- " for i, click_time in user_item_time_dict[sim_u]:\n",
- " if i in user_hist_items:\n",
- " continue\n",
- " items_rank.setdefault(i, 0)\n",
- " \n",
- " loc_weight = 1.0\n",
- " content_weight = 1.0\n",
- " created_time_weight = 1.0\n",
- " \n",
- " # 当前文章与该用户看的历史文章进行一个权重交互\n",
- " for loc, (j, click_time) in enumerate(user_item_time_list):\n",
- " # 点击时的相对位置权重\n",
- " loc_weight += 0.9 ** (len(user_item_time_list) - loc)\n",
- " # 内容相似性权重\n",
- " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
- " content_weight += emb_i2i_sim[i][j]\n",
- " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
- " content_weight += emb_i2i_sim[j][i]\n",
- " \n",
- " # 创建时间差权重\n",
- " created_time_weight += np.exp(0.8 * np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
- " \n",
- " items_rank[i] += loc_weight * content_weight * created_time_weight * wuv\n",
- " \n",
- " # 热度补全\n",
- " if len(items_rank) < recall_item_num:\n",
- " for i, item in enumerate(item_topk_click):\n",
- " if item in items_rank.items(): # 填充的item应该不在原来的列表中\n",
- " continue\n",
- " items_rank[item] = - i - 100 # 随便给个复数就行\n",
- " if len(items_rank) == recall_item_num:\n",
- " break\n",
- " \n",
- " items_rank = sorted(items_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num] \n",
- " \n",
- " return items_rank"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### usercf sim召回"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:05:41.652501Z",
- "start_time": "2020-11-16T07:05:40.953871Z"
- }
- },
- "outputs": [],
- "source": [
- "# 这里是为了召回评估,所以提取最后一次点击\n",
- "# 由于usercf中计算user之间的相似度的过程太费内存了,全量数据这里就没有跑,跑了一个采样之后的数据\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- " \n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "\n",
- "u2u_sim = pickle.load(open(save_path + 'usercf_u2u_sim.pkl', 'rb'))\n",
- "\n",
- "sim_user_topk = 20\n",
- "recall_item_num = 10\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
- " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim) \n",
- "\n",
- "pickle.dump(user_recall_items_dict, open(save_path + 'usercf_u2u2i_recall.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_recall_items_dict, trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T03:09:35.853516Z",
- "start_time": "2020-11-16T03:09:35.737625Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### user embedding sim召回\n",
- "\n",
- "虽然没有直接跑usercf的计算用户之间的相似度,为了验证上述基于用户的协同过滤的代码,下面使用了YoutubeDNN过程中产生的user embedding来进行向量检索每个user最相似的topk个user,在使用这里得到的u2u的相似性矩阵,使用usercf进行召回,具体代码如下"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:09:46.807811Z",
- "start_time": "2020-11-17T02:09:46.798033Z"
- }
- },
- "outputs": [],
- "source": [
- "# 使用Embedding的方式获取u2u的相似性矩阵\n",
- "# topk指的是每个user, faiss搜索后返回最相似的topk个user\n",
- "def u2u_embdding_sim(click_df, user_emb_dict, save_path, topk):\n",
- " \n",
- " user_list = []\n",
- " user_emb_list = []\n",
- " for user_id, user_emb in user_emb_dict.items():\n",
- " user_list.append(user_id)\n",
- " user_emb_list.append(user_emb)\n",
- " \n",
- " user_index_2_rawid_dict = {k: v for k, v in zip(range(len(user_list)), user_list)} \n",
- " \n",
- " user_emb_np = np.array(user_emb_list, dtype=np.float32)\n",
- " \n",
- " # 建立faiss索引\n",
- " user_index = faiss.IndexFlatIP(user_emb_np.shape[1])\n",
- " user_index.add(user_emb_np)\n",
- " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
- " sim, idx = user_index.search(user_emb_np, topk) # 返回的是列表\n",
- " \n",
- " # 将向量检索的结果保存成原始id的对应关系\n",
- " user_sim_dict = collections.defaultdict(dict)\n",
- " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(user_emb_np)), sim, idx)):\n",
- " target_raw_id = user_index_2_rawid_dict[target_idx]\n",
- " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
- " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
- " rele_raw_id = user_index_2_rawid_dict[rele_idx]\n",
- " user_sim_dict[target_raw_id][rele_raw_id] = user_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
- " \n",
- " # 保存i2i相似度矩阵\n",
- " pickle.dump(user_sim_dict, open(save_path + 'youtube_u2u_sim.pkl', 'wb')) \n",
- " return user_sim_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:14:31.355905Z",
- "start_time": "2020-11-17T02:09:53.236531Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "250000it [00:23, 10507.45it/s]\n"
- ]
- }
- ],
- "source": [
- "# 读取YoutubeDNN过程中产生的user embedding, 然后使用faiss计算用户之间的相似度\n",
- "# 这里需要注意,这里得到的user embedding其实并不是很好,因为YoutubeDNN中使用的是用户点击序列来训练的user embedding,\n",
- "# 如果序列普遍都比较短的话,其实效果并不是很好\n",
- "user_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
- "u2u_sim = u2u_embdding_sim(all_click_df, user_emb_dict, save_path, topk=10)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "通过YoutubeDNN得到的user_embedding"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:49:40.755431Z",
- "start_time": "2020-11-17T02:28:47.003514Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [19:43<00:00, 211.22it/s]\n"
- ]
- }
- ],
- "source": [
- "# 使用召回评估函数验证当前召回方式的效果\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "u2u_sim = pickle.load(open(save_path + 'youtube_u2u_sim.pkl', 'rb'))\n",
- "\n",
- "sim_user_topk = 20\n",
- "recall_item_num = 10\n",
- "\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
- " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
- " \n",
- "user_multi_recall_dict['youtubednn_usercf_recall'] = user_recall_items_dict\n",
- "pickle.dump(user_multi_recall_dict['youtubednn_usercf_recall'], open(save_path + 'youtubednn_usercf_recall.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['youtubednn_usercf_recall'], trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:07:44.326253Z",
- "start_time": "2020-11-16T07:07:43.798931Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 冷启动问题"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**冷启动问题可以分成三类:文章冷启动,用户冷启动,系统冷启动。**\n",
- "\n",
- "- 文章冷启动:对于一个平台系统新加入的文章,该文章没有任何的交互记录,如何推荐给用户的问题。(对于我们场景可以认为是,日志数据中没有出现过的文章都可以认为是冷启动的文章)\n",
- "- 用户冷启动:对于一个平台系统新来的用户,该用户还没有文章的交互信息,如何给该用户进行推荐。(对于我们场景就是,测试集中的用户是否在测试集对应的log数据中出现过,如果没有出现过,那么可以认为该用户是冷启动用户。但是有时候并没有这么严格,我们也可以自己设定某些指标来判别哪些用户是冷启动用户,比如通过使用时长,点击率,留存率等等)\n",
- "- 系统冷启动:就是对于一个平台刚上线,还没有任何的相关历史数据,此时就是系统冷启动,其实也就是前面两种的一个综合。\n",
- "\n",
- "**当前场景下冷启动问题的分析:**\n",
- "\n",
- "对当前的数据进行分析会发现,日志中所有出现过的点击文章只有3w多个,而整个文章库中却有30多万,那么测试集中的用户最后一次点击是否会点击没有出现在日志中的文章呢?如果存在这种情况,说明用户点击的文章之前没有任何的交互信息,这也就是我们所说的文章冷启动。通过数据分析还可以发现,测试集用户只有一次点击的数据占得比例还不少,其实仅仅通过用户的一次点击就给用户推荐文章使用模型的方式也是比较难的,这里其实也可以考虑用户冷启动的问题,但是这里只给出物品冷启动的一些解决方案及代码,关于用户冷启动的话提一些可行性的做法。\n",
- "\n",
- "1. 文章冷启动(没有冷启动的探索问题) \n",
- " 其实我们这里不是为了做文章的冷启动而做冷启动,而是猜测用户可能会点击一些没有在log数据中出现的文章,我们要做的就是如何从将近27万的文章中选择一些文章作为用户冷启动的文章,这里其实也可以看成是一种召回策略,我们这里就采用简单的比较好理解的基于规则的召回策略来获取用户可能点击的未出现在log数据中的文章。\n",
- " 现在的问题变成了:如何给每个用户考虑从27万个商品中获取一小部分商品?随机选一些可能是一种方案。下面给出一些参考的方案。\n",
- " 1. 首先基于Embedding召回一部分与用户历史相似的文章\n",
- " 2. 从基于Embedding召回的文章中通过一些规则过滤掉一些文章,使得留下的文章用户更可能点击。我们这里的规则,可以是,留下那些与用户历史点击文章主题相同的文章,或者字数相差不大的文章。并且留下的文章尽量是与测试集用户最后一次点击时间更接近的文章,或者是当天的文章也行。\n",
- "2. 用户冷启动 \n",
- " 这里对测试集中的用户点击数据进行分析会发现,测试集中有百分之20的用户只有一次点击,那么这些点击特别少的用户的召回是不是可以单独做一些策略上的补充呢?或者是在排序后直接基于规则加上一些文章呢?这些都可以去尝试,这里没有提供具体的做法。\n",
- " \n",
- "**注意:** \n",
- "\n",
- "这里看似和基于embedding计算的item之间相似度然后做itemcf是一致的,但是现在我们的目的不一样,我们这里的目的是找到相似的向量,并且还没有出现在log日志中的商品,再加上一些其他的冷启动的策略,这里需要找回的数量会偏多一点,不然被筛选完之后可能都没有文章了"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T04:30:23.027164Z",
- "start_time": "2020-11-17T04:23:09.960235Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [05:01<00:00, 828.60it/s] \n"
- ]
- }
- ],
- "source": [
- "# 先进行itemcf召回,这里不需要做召回评估,这里只是一种策略\n",
- "trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
- "\n",
- "sim_item_topk = 150\n",
- "recall_item_num = 100 # 稍微召回多一点文章,便于后续的规则筛选\n",
- "\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
- " recall_item_num, item_topk_click,item_created_time_dict, emb_i2i_sim)\n",
- "pickle.dump(user_recall_items_dict, open(save_path + 'cold_start_items_raw_dict.pkl', 'wb'))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:11:39.267581Z",
- "start_time": "2020-11-17T06:11:39.252563Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于规则进行文章过滤\n",
- "# 保留文章主题与用户历史浏览主题相似的文章\n",
- "# 保留文章字数与用户历史浏览文章字数相差不大的文章\n",
- "# 保留最后一次点击当天的文章\n",
- "# 按照相似度返回最终的结果\n",
- "\n",
- "def get_click_article_ids_set(all_click_df):\n",
- " return set(all_click_df.click_article_id.values)\n",
- "\n",
- "def cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
- " user_last_item_created_time_dict, item_type_dict, item_words_dict, \n",
- " item_created_time_dict, click_article_ids_set, recall_item_num):\n",
- " \"\"\"\n",
- " 冷启动的情况下召回一些文章\n",
- " :param user_recall_items_dict: 基于内容embedding相似性召回来的很多文章, 字典, {user1: [(item1, item2), ..], }\n",
- " :param user_hist_item_typs_dict: 字典, 用户点击的文章的主题映射\n",
- " :param user_hist_item_words_dict: 字典, 用户点击的历史文章的字数映射\n",
- " :param user_last_item_created_time_idct: 字典,用户点击的历史文章创建时间映射\n",
- " :param item_tpye_idct: 字典,文章主题映射\n",
- " :param item_words_dict: 字典,文章字数映射\n",
- " :param item_created_time_dict: 字典, 文章创建时间映射\n",
- " :param click_article_ids_set: 集合,用户点击过得文章, 也就是日志里面出现过的文章\n",
- " :param recall_item_num: 召回文章的数量, 这个指的是没有出现在日志里面的文章数量\n",
- " \"\"\"\n",
- " \n",
- " cold_start_user_items_dict = {}\n",
- " for user, item_list in tqdm(user_recall_items_dict.items()):\n",
- " cold_start_user_items_dict.setdefault(user, [])\n",
- " for item, score in item_list:\n",
- " # 获取历史文章信息\n",
- " hist_item_type_set = user_hist_item_typs_dict[user]\n",
- " hist_mean_words = user_hist_item_words_dict[user]\n",
- " hist_last_item_created_time = user_last_item_created_time_dict[user]\n",
- " hist_last_item_created_time = datetime.fromtimestamp(hist_last_item_created_time)\n",
- " \n",
- " # 获取当前召回文章的信息\n",
- " curr_item_type = item_type_dict[item]\n",
- " curr_item_words = item_words_dict[item]\n",
- " curr_item_created_time = item_created_time_dict[item]\n",
- " curr_item_created_time = datetime.fromtimestamp(curr_item_created_time)\n",
- "\n",
- " # 首先,文章不能出现在用户的历史点击中, 然后根据文章主题,文章单词数,文章创建时间进行筛选\n",
- " if curr_item_type not in hist_item_type_set or \\\n",
- " item in click_article_ids_set or \\\n",
- " abs(curr_item_words - hist_mean_words) > 200 or \\\n",
- " abs((curr_item_created_time - hist_last_item_created_time).days) > 90: \n",
- " continue\n",
- " \n",
- " cold_start_user_items_dict[user].append((item, score)) # {user1: [(item1, score1), (item2, score2)..]...}\n",
- " \n",
- " # 需要控制一下冷启动召回的数量\n",
- " cold_start_user_items_dict = {k: sorted(v, key=lambda x:x[1], reverse=True)[:recall_item_num] \\\n",
- " for k, v in cold_start_user_items_dict.items()}\n",
- " \n",
- " pickle.dump(cold_start_user_items_dict, open(save_path + 'cold_start_user_items_dict.pkl', 'wb'))\n",
- " \n",
- " return cold_start_user_items_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:35:38.758278Z",
- "start_time": "2020-11-17T06:31:40.164332Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [01:49<00:00, 2293.37it/s]\n"
- ]
- }
- ],
- "source": [
- "all_click_df_ = all_click_df.copy()\n",
- "all_click_df_ = all_click_df_.merge(item_info_df, how='left', on='click_article_id')\n",
- "user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict = get_user_hist_item_info_dict(all_click_df_)\n",
- "click_article_ids_set = get_click_article_ids_set(all_click_df)\n",
- "# 需要注意的是\n",
- "# 这里使用了很多规则来筛选冷启动的文章,所以前面再召回的阶段就应该尽可能的多召回一些文章,否则很容易被删掉\n",
- "cold_start_user_items_dict = cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
- " user_last_item_created_time_dict, item_type_dict, item_words_dict, \\\n",
- " item_created_time_dict, click_article_ids_set, recall_item_num)\n",
- "\n",
- "user_multi_recall_dict['cold_start_recall'] = cold_start_user_items_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:13:33.099298Z",
- "start_time": "2020-11-16T07:13:32.655036Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 多路召回合并\n",
- "多路召回合并就是将前面所有的召回策略得到的用户文章列表合并起来,下面是对前面所有召回结果的汇总\n",
- "1. 基于itemcf计算的item之间的相似度sim进行的召回 \n",
- "2. 基于embedding搜索得到的item之间的相似度进行的召回\n",
- "3. YoutubeDNN召回\n",
- "4. YoutubeDNN得到的user之间的相似度进行的召回\n",
- "5. 基于冷启动策略的召回\n",
- "\n",
- "**注意:** \n",
- "在做召回评估的时候就会发现有些召回的效果不错有些召回的效果很差,所以对每一路召回的结果,我们可以认为的定义一些权重,来做最终的相似度融合"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 38,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:02:16.033971Z",
- "start_time": "2020-11-17T07:02:16.019819Z"
- }
- },
- "outputs": [],
- "source": [
- "def combine_recall_results(user_multi_recall_dict, weight_dict=None, topk=25):\n",
- " final_recall_items_dict = {}\n",
- " \n",
- " # 对每一种召回结果按照用户进行归一化,方便后面多种召回结果,相同用户的物品之间权重相加\n",
- " def norm_user_recall_items_sim(sorted_item_list):\n",
- " # 如果冷启动中没有文章或者只有一篇文章,直接返回,出现这种情况的原因可能是冷启动召回的文章数量太少了,\n",
- " # 基于规则筛选之后就没有文章了, 这里还可以做一些其他的策略性的筛选\n",
- " if len(sorted_item_list) < 2:\n",
- " return sorted_item_list\n",
- " \n",
- " min_sim = sorted_item_list[-1][1]\n",
- " max_sim = sorted_item_list[0][1]\n",
- " \n",
- " norm_sorted_item_list = []\n",
- " for item, score in sorted_item_list:\n",
- " if max_sim > 0:\n",
- " norm_score = 1.0 * (score - min_sim) / (max_sim - min_sim) if max_sim > min_sim else 1.0\n",
- " else:\n",
- " norm_score = 0.0\n",
- " norm_sorted_item_list.append((item, norm_score))\n",
- " \n",
- " return norm_sorted_item_list\n",
- " \n",
- " print('多路召回合并...')\n",
- " for method, user_recall_items in tqdm(user_multi_recall_dict.items()):\n",
- " print(method + '...')\n",
- " # 在计算最终召回结果的时候,也可以为每一种召回结果设置一个权重\n",
- " if weight_dict == None:\n",
- " recall_method_weight = 1\n",
- " else:\n",
- " recall_method_weight = weight_dict[method]\n",
- " \n",
- " for user_id, sorted_item_list in user_recall_items.items(): # 进行归一化\n",
- " user_recall_items[user_id] = norm_user_recall_items_sim(sorted_item_list)\n",
- " \n",
- " for user_id, sorted_item_list in user_recall_items.items():\n",
- " # print('user_id')\n",
- " final_recall_items_dict.setdefault(user_id, {})\n",
- " for item, score in sorted_item_list:\n",
- " final_recall_items_dict[user_id].setdefault(item, 0)\n",
- " final_recall_items_dict[user_id][item] += recall_method_weight * score \n",
- " \n",
- " final_recall_items_dict_rank = {}\n",
- " # 多路召回时也可以控制最终的召回数量\n",
- " for user, recall_item_dict in final_recall_items_dict.items():\n",
- " final_recall_items_dict_rank[user] = sorted(recall_item_dict.items(), key=lambda x: x[1], reverse=True)[:topk]\n",
- "\n",
- " # 将多路召回后的最终结果字典保存到本地\n",
- " pickle.dump(final_recall_items_dict_rank, open(os.path.join(save_path, 'final_recall_items_dict.pkl'),'wb'))\n",
- "\n",
- " return final_recall_items_dict_rank"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:02:21.078455Z",
- "start_time": "2020-11-17T07:02:21.074060Z"
- }
- },
- "outputs": [],
- "source": [
- "# 这里直接对多路召回的权重给了一个相同的值,其实可以根据前面召回的情况来调整参数的值\n",
- "weight_dict = {'itemcf_sim_itemcf_recall': 1.0,\n",
- " 'embedding_sim_item_recall': 1.0,\n",
- " 'youtubednn_recall': 1.0,\n",
- " 'youtubednn_usercf_recall': 1.0, \n",
- " 'cold_start_recall': 1.0}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:04:35.747924Z",
- "start_time": "2020-11-17T07:02:26.889573Z"
- }
- },
- "outputs": [
+ "cells": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 0%| | 0/5 [00:00, ?it/s]"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 多路召回\n",
+ "\n",
+ "所谓的“多路召回”策略,就是指采用不同的策略、特征或简单模型,分别召回一部分候选集,然后把候选集混合在一起供后续排序模型使用,可以明显的看出,“多路召回策略”是在“计算速度”和“召回率”之间进行权衡的结果。其中,各种简单策略保证候选集的快速召回,从不同角度设计的策略保证召回率接近理想的状态,不至于损伤排序效果。如下图是多路召回的一个示意图,在多路召回中,每个策略之间毫不相关,所以一般可以写并发多线程同时进行,这样可以更加高效。\n",
+ "\n",
+ "
\n",
+ "\n",
+ "上图只是一个多路召回的例子,也就是说可以使用多种不同的策略来获取用户排序的候选商品集合,而具体使用哪些召回策略其实是与业务强相关的 ,针对不同的任务就会有对于该业务真实场景下需要考虑的召回规则。例如新闻推荐,召回规则可以是“热门新闻”、“作者召回”、“关键词召回”、“主题召回“、”协同过滤召回“等等。 \n",
+ "\n"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "多路召回合并...\n",
- "itemcf_sim_itemcf_recall...\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 导包"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 20%|██ | 1/5 [00:08<00:34, 8.66s/it]"
- ]
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:29.834662Z",
+ "start_time": "2020-11-16T11:26:27.811511Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd \n",
+ "import numpy as np\n",
+ "from tqdm import tqdm \n",
+ "from collections import defaultdict \n",
+ "import os, math, warnings, math, pickle\n",
+ "from tqdm import tqdm\n",
+ "import faiss\n",
+ "import collections\n",
+ "import random\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "from sklearn.preprocessing import LabelEncoder\n",
+ "from datetime import datetime\n",
+ "from deepctr.feature_column import SparseFeat, VarLenSparseFeat\n",
+ "from sklearn.preprocessing import LabelEncoder\n",
+ "from tensorflow.python.keras import backend as K\n",
+ "from tensorflow.python.keras.models import Model\n",
+ "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
+ "\n",
+ "from deepmatch.models import *\n",
+ "from deepmatch.utils import sampledsoftmaxloss\n",
+ "warnings.filterwarnings('ignore')"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "embedding_sim_item_recall...\n"
- ]
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:31.831215Z",
+ "start_time": "2020-11-16T11:26:31.826939Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data_path = './data_raw/'\n",
+ "save_path = './temp_results/'\n",
+ "# 做召回评估的一个标志, 如果不进行评估就是直接使用全量数据进行召回\n",
+ "metric_recall = False"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 40%|████ | 2/5 [00:16<00:24, 8.29s/it]"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取数据\n",
+ "在一般的rs比赛中读取数据部分主要分为三种模式, 不同的模式对应的不同的数据集:\n",
+ "1. debug模式: 这个的目的是帮助我们基于数据先搭建一个简易的baseline并跑通, 保证写的baseline代码没有什么问题。 由于推荐比赛的数据往往非常巨大, 如果一上来直接采用全部的数据进行分析,搭建baseline框架, 往往会带来时间和设备上的损耗, **所以这时候我们往往需要从海量数据的训练集中随机抽取一部分样本来进行调试(train_click_log_sample)**, 先跑通一个baseline。\n",
+ "2. 线下验证模式: 这个的目的是帮助我们在线下基于已有的训练集数据, 来选择好合适的模型和一些超参数。 **所以我们这一块只需要加载整个训练集(train_click_log)**, 然后把整个训练集再分成训练集和验证集。 训练集是模型的训练数据, 验证集部分帮助我们调整模型的参数和其他的一些超参数。\n",
+ "3. 线上模式: 我们用debug模式搭建起一个推荐系统比赛的baseline, 用线下验证模式选择好了模型和一些超参数, 这一部分就是真正的对于给定的测试集进行预测, 提交到线上, **所以这一块使用的训练数据集是全量的数据集(train_click_log+test_click_log)**\n",
+ "\n",
+ "下面就分别对这三种不同的数据读取模式先建立不同的代导入函数, 方便后面针对不同的模式下导入数据。"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "youtubednn_recall...\n",
- "youtubednn_usercf_recall...\n"
- ]
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:34.476240Z",
+ "start_time": "2020-11-16T11:26:34.467352Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# debug模式: 从训练集中划出一部分数据来调试代码\n",
+ "def get_all_click_sample(data_path, sample_nums=10000):\n",
+ " \"\"\"\n",
+ " 训练集中采样一部分数据调试\n",
+ " data_path: 原数据的存储路径\n",
+ " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
+ " \"\"\"\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " all_user_ids = all_click.user_id.unique()\n",
+ "\n",
+ " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
+ " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click\n",
+ "\n",
+ "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
+ "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
+ "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
+ " if offline:\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " else:\n",
+ " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
+ "\n",
+ " all_click = trn_click.append(tst_click)\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 80%|████████ | 4/5 [00:23<00:06, 6.98s/it]"
- ]
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:35.168738Z",
+ "start_time": "2020-11-16T11:26:35.163210Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取文章的基本属性\n",
+ "def get_item_info_df(data_path):\n",
+ " item_info_df = pd.read_csv(data_path + 'articles.csv')\n",
+ " \n",
+ " # 为了方便与训练集中的click_article_id拼接,需要把article_id修改成click_article_id\n",
+ " item_info_df = item_info_df.rename(columns={'article_id': 'click_article_id'})\n",
+ " \n",
+ " return item_info_df"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "cold_start_recall...\n"
- ]
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:36.152958Z",
+ "start_time": "2020-11-16T11:26:36.146324Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取文章的Embedding数据\n",
+ "def get_item_emb_dict(data_path):\n",
+ " item_emb_df = pd.read_csv(data_path + 'articles_emb.csv')\n",
+ " \n",
+ " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
+ " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols])\n",
+ " # 进行归一化\n",
+ " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
+ "\n",
+ " item_emb_dict = dict(zip(item_emb_df['article_id'], item_emb_np))\n",
+ " pickle.dump(item_emb_dict, open(save_path + 'item_content_emb.pkl', 'wb'))\n",
+ " \n",
+ " return item_emb_dict"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 5/5 [00:42<00:00, 8.40s/it]\n"
- ]
- }
- ],
- "source": [
- "# 最终合并之后每个用户召回150个商品进行排序\n",
- "final_recall_items_dict_rank = combine_recall_results(user_multi_recall_dict, weight_dict, topk=150)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 总结\n",
- "\n",
- "上述实现了如下召回策略:\n",
- "\n",
- "1. 基于关联规则的itemcf\n",
- "2. 基于关联规则的usercf\n",
- "3. youtubednn召回\n",
- "4. 冷启动召回\n",
- "\n",
- "对于上述实现的召回策略其实都不是最优的结果,我们只是做了个简单的尝试,其中还有很多地方可以优化,包括已经实现的这些召回策略的参数或者新加一些,修改一些关联规则都可以。当然还可以尝试更多的召回策略,比如对新闻进行热度召回等等。\n",
- "\n",
- "\n",
- "\n",
- "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "nbTranslate": {
- "displayLangs": [
- "*"
- ],
- "hotkey": "alt-t",
- "langInMainMenu": true,
- "sourceLang": "en",
- "targetLang": "fr",
- "useGoogleTranslate": true
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:37.333536Z",
+ "start_time": "2020-11-16T11:26:37.329545Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:42.163494Z",
+ "start_time": "2020-11-16T11:26:38.018094Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 采样数据\n",
+ "# all_click_df = get_all_click_sample(data_path)\n",
+ "\n",
+ "# 全量训练集\n",
+ "all_click_df = get_all_click_df(offline=False)\n",
+ "\n",
+ "# 对时间戳进行归一化,用于在关联规则的时候计算权重\n",
+ "all_click_df['click_timestamp'] = all_click_df[['click_timestamp']].apply(max_min_scaler)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:44.343500Z",
+ "start_time": "2020-11-16T11:26:44.113891Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "item_info_df = get_item_info_df(data_path)"
+ ]
+ },
{
- "id": "83580",
- "title": "零基础入门推荐系统 - 新闻推荐"
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:24.295343Z",
+ "start_time": "2020-11-16T11:26:44.398007Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "item_emb_dict = get_item_emb_dict(data_path)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 工具函数"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取用户-文章-时间函数\n",
+ "这个在基于关联规则的用户协同过滤的时候会用到"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:33.791656Z",
+ "start_time": "2020-11-16T11:27:33.784305Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ "def get_user_item_time(click_df):\n",
+ " \n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " \n",
+ " def make_item_time_pair(df):\n",
+ " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
+ " \n",
+ " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
+ " .reset_index().rename(columns={0: 'item_time_list'})\n",
+ " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
+ " \n",
+ " return user_item_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取文章-用户-时间函数\n",
+ "这个在基于关联规则的文章协同过滤的时候会用到"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:38.327581Z",
+ "start_time": "2020-11-16T11:27:38.321059Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 根据时间获取商品被点击的用户序列 {item1: [(user1, time1), (user2, time2)...]...}\n",
+ "# 这里的时间是用户点击当前商品的时间,好像没有直接的关系。\n",
+ "def get_item_user_time_dict(click_df):\n",
+ " def make_user_time_pair(df):\n",
+ " return list(zip(df['user_id'], df['click_timestamp']))\n",
+ " \n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " item_user_time_df = click_df.groupby('click_article_id')['user_id', 'click_timestamp'].apply(lambda x: make_user_time_pair(x))\\\n",
+ " .reset_index().rename(columns={0: 'user_time_list'})\n",
+ " \n",
+ " item_user_time_dict = dict(zip(item_user_time_df['click_article_id'], item_user_time_df['user_time_list']))\n",
+ " return item_user_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取历史和最后一次点击\n",
+ "这个在评估召回结果, 特征工程和制作标签转成监督学习测试集的时候回用到"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:50.894683Z",
+ "start_time": "2020-11-16T11:27:50.888002Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取当前数据的历史点击和最后一次点击\n",
+ "def get_hist_and_last_click(all_click):\n",
+ " \n",
+ " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
+ " click_last_df = all_click.groupby('user_id').tail(1)\n",
+ "\n",
+ " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
+ " def hist_func(user_df):\n",
+ " if len(user_df) == 1:\n",
+ " return user_df\n",
+ " else:\n",
+ " return user_df[:-1]\n",
+ "\n",
+ " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
+ "\n",
+ " return click_hist_df, click_last_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取文章属性特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:55.893810Z",
+ "start_time": "2020-11-16T11:27:55.887623Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取文章id对应的基本属性,保存成字典的形式,方便后面召回阶段,冷启动阶段直接使用\n",
+ "def get_item_info_dict(item_info_df):\n",
+ " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
+ " item_info_df['created_at_ts'] = item_info_df[['created_at_ts']].apply(max_min_scaler)\n",
+ " \n",
+ " item_type_dict = dict(zip(item_info_df['click_article_id'], item_info_df['category_id']))\n",
+ " item_words_dict = dict(zip(item_info_df['click_article_id'], item_info_df['words_count']))\n",
+ " item_created_time_dict = dict(zip(item_info_df['click_article_id'], item_info_df['created_at_ts']))\n",
+ " \n",
+ " return item_type_dict, item_words_dict, item_created_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T06:42:38.730939Z",
+ "start_time": "2020-11-13T06:42:38.728461Z"
+ }
+ },
+ "source": [
+ "### 获取用户历史点击的文章信息"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:59.650781Z",
+ "start_time": "2020-11-16T11:27:59.640572Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_user_hist_item_info_dict(all_click):\n",
+ " \n",
+ " # 获取user_id对应的用户历史点击文章类型的集合字典\n",
+ " user_hist_item_typs = all_click.groupby('user_id')['category_id'].agg(set).reset_index()\n",
+ " user_hist_item_typs_dict = dict(zip(user_hist_item_typs['user_id'], user_hist_item_typs['category_id']))\n",
+ " \n",
+ " # 获取user_id对应的用户点击文章的集合\n",
+ " user_hist_item_ids_dict = all_click.groupby('user_id')['click_article_id'].agg(set).reset_index()\n",
+ " user_hist_item_ids_dict = dict(zip(user_hist_item_ids_dict['user_id'], user_hist_item_ids_dict['click_article_id']))\n",
+ " \n",
+ " # 获取user_id对应的用户历史点击的文章的平均字数字典\n",
+ " user_hist_item_words = all_click.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
+ " user_hist_item_words_dict = dict(zip(user_hist_item_words['user_id'], user_hist_item_words['words_count']))\n",
+ " \n",
+ " # 获取user_id对应的用户最后一次点击的文章的创建时间\n",
+ " all_click_ = all_click.sort_values('click_timestamp')\n",
+ " user_last_item_created_time = all_click_.groupby('user_id')['created_at_ts'].apply(lambda x: x.iloc[-1]).reset_index()\n",
+ " \n",
+ " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
+ " user_last_item_created_time['created_at_ts'] = user_last_item_created_time[['created_at_ts']].apply(max_min_scaler)\n",
+ " \n",
+ " user_last_item_created_time_dict = dict(zip(user_last_item_created_time['user_id'], \\\n",
+ " user_last_item_created_time['created_at_ts']))\n",
+ " \n",
+ " return user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取点击次数最多的topk个文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:28:04.761105Z",
+ "start_time": "2020-11-16T11:28:04.756419Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取近期点击最多的文章\n",
+ "def get_item_topk_click(click_df, k):\n",
+ " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
+ " return topk_click"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 定义多路召回字典"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:28:08.321506Z",
+ "start_time": "2020-11-16T11:28:07.623281Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取文章的属性信息,保存成字典的形式方便查询\n",
+ "item_type_dict, item_words_dict, item_created_time_dict = get_item_info_dict(item_info_df)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:28:13.791569Z",
+ "start_time": "2020-11-16T11:28:13.786522Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 定义一个多路召回的字典,将各路召回的结果都保存在这个字典当中\n",
+ "user_multi_recall_dict = {'itemcf_sim_itemcf_recall': {},\n",
+ " 'embedding_sim_item_recall': {},\n",
+ " 'youtubednn_recall': {},\n",
+ " 'youtubednn_usercf_recall': {}, \n",
+ " 'cold_start_recall': {}}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T05:41:12.710754Z",
+ "start_time": "2020-11-16T05:40:57.842614Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 提取最后一次点击作为召回评估,如果不需要做召回评估直接使用全量的训练集进行召回(线下验证模型)\n",
+ "# 如果不是召回评估,直接使用全量数据进行召回,不用将最后一次提取出来\n",
+ "trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 召回效果评估函数\n",
+ "做完了召回有时候也需要对当前的召回方法或者参数进行调整以达到更好的召回效果,因为召回的结果决定了最终排序的上限,下面也会提供一个召回评估的方法"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T05:41:18.579118Z",
+ "start_time": "2020-11-16T05:41:18.571887Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 依次评估召回的前10, 20, 30, 40, 50个文章中的击中率\n",
+ "def metrics_recall(user_recall_items_dict, trn_last_click_df, topk=5):\n",
+ " last_click_item_dict = dict(zip(trn_last_click_df['user_id'], trn_last_click_df['click_article_id']))\n",
+ " user_num = len(user_recall_items_dict)\n",
+ " \n",
+ " for k in range(10, topk+1, 10):\n",
+ " hit_num = 0\n",
+ " for user, item_list in user_recall_items_dict.items():\n",
+ " # 获取前k个召回的结果\n",
+ " tmp_recall_items = [x[0] for x in user_recall_items_dict[user][:k]]\n",
+ " if last_click_item_dict[user] in set(tmp_recall_items):\n",
+ " hit_num += 1\n",
+ " \n",
+ " hit_rate = round(hit_num * 1.0 / user_num, 5)\n",
+ " print(' topk: ', k, ' : ', 'hit_num: ', hit_num, 'hit_rate: ', hit_rate, 'user_num : ', user_num)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 计算相似性矩阵\n",
+ "\n",
+ "这一部分主要是通过协同过滤以及向量检索得到相似性矩阵,相似性矩阵主要分为user2user和item2item,下面依次获取基于itemcf的item2item的相似性矩阵,"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### itemcf i2i_sim\n",
+ "\n",
+ "借鉴KDD2020的去偏商品推荐,在计算item2item相似性矩阵时,使用关联规则,使得计算的文章的相似性还考虑到了:\n",
+ "1. 用户点击的时间权重\n",
+ "2. 用户点击的顺序权重\n",
+ "3. 文章创建的时间权重"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:30:51.872262Z",
+ "start_time": "2020-11-16T11:30:51.860099Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def itemcf_sim(df, item_created_time_dict):\n",
+ " \"\"\"\n",
+ " 文章与文章之间的相似性矩阵计算\n",
+ " :param df: 数据表\n",
+ " :item_created_time_dict: 文章创建时间的字典\n",
+ " return : 文章与文章的相似性矩阵\n",
+ " \n",
+ " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
+ " \"\"\"\n",
+ " \n",
+ " user_item_time_dict = get_user_item_time(df)\n",
+ " \n",
+ " # 计算物品相似度\n",
+ " i2i_sim = {}\n",
+ " item_cnt = defaultdict(int)\n",
+ " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
+ " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
+ " for loc1, (i, i_click_time) in enumerate(item_time_list):\n",
+ " item_cnt[i] += 1\n",
+ " i2i_sim.setdefault(i, {})\n",
+ " for loc2, (j, j_click_time) in enumerate(item_time_list):\n",
+ " if(i == j):\n",
+ " continue\n",
+ " \n",
+ " # 考虑文章的正向顺序点击和反向顺序点击 \n",
+ " loc_alpha = 1.0 if loc2 > loc1 else 0.7\n",
+ " # 位置信息权重,其中的参数可以调节\n",
+ " loc_weight = loc_alpha * (0.9 ** (np.abs(loc2 - loc1) - 1))\n",
+ " # 点击时间权重,其中的参数可以调节\n",
+ " click_time_weight = np.exp(0.7 ** np.abs(i_click_time - j_click_time))\n",
+ " # 两篇文章创建时间的权重,其中的参数可以调节\n",
+ " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
+ " i2i_sim[i].setdefault(j, 0)\n",
+ " # 考虑多种因素的权重计算最终的文章之间的相似度\n",
+ " i2i_sim[i][j] += loc_weight * click_time_weight * created_time_weight / math.log(len(item_time_list) + 1)\n",
+ " \n",
+ " i2i_sim_ = i2i_sim.copy()\n",
+ " for i, related_items in i2i_sim.items():\n",
+ " for j, wij in related_items.items():\n",
+ " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
+ " \n",
+ " # 将得到的相似性矩阵保存到本地\n",
+ " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
+ " \n",
+ " return i2i_sim_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:47:09.937002Z",
+ "start_time": "2020-11-16T11:30:57.394334Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [14:20<00:00, 290.38it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "i2i_sim = itemcf_sim(all_click_df, item_created_time_dict)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### usercf u2u_sim\n",
+ "\n",
+ "在计算用户之间的相似度的时候,也可以使用一些简单的关联规则,比如用户活跃度权重,这里将用户的点击次数作为用户活跃度的指标"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:11:14.951940Z",
+ "start_time": "2020-11-16T09:11:14.945654Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_user_activate_degree_dict(all_click_df):\n",
+ " all_click_df_ = all_click_df.groupby('user_id')['click_article_id'].count().reset_index()\n",
+ " \n",
+ " # 用户活跃度归一化\n",
+ " mm = MinMaxScaler()\n",
+ " all_click_df_['click_article_id'] = mm.fit_transform(all_click_df_[['click_article_id']])\n",
+ " user_activate_degree_dict = dict(zip(all_click_df_['user_id'], all_click_df_['click_article_id']))\n",
+ " \n",
+ " return user_activate_degree_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:11:19.879276Z",
+ "start_time": "2020-11-16T09:11:19.868808Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def usercf_sim(all_click_df, user_activate_degree_dict):\n",
+ " \"\"\"\n",
+ " 用户相似性矩阵计算\n",
+ " :param all_click_df: 数据表\n",
+ " :param user_activate_degree_dict: 用户活跃度的字典\n",
+ " return 用户相似性矩阵\n",
+ " \n",
+ " 思路: 基于用户的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
+ " \"\"\"\n",
+ " item_user_time_dict = get_item_user_time_dict(all_click_df)\n",
+ " \n",
+ " u2u_sim = {}\n",
+ " user_cnt = defaultdict(int)\n",
+ " for item, user_time_list in tqdm(item_user_time_dict.items()):\n",
+ " for u, click_time in user_time_list:\n",
+ " user_cnt[u] += 1\n",
+ " u2u_sim.setdefault(u, {})\n",
+ " for v, click_time in user_time_list:\n",
+ " u2u_sim[u].setdefault(v, 0)\n",
+ " if u == v:\n",
+ " continue\n",
+ " # 用户平均活跃度作为活跃度的权重,这里的式子也可以改善\n",
+ " activate_weight = 100 * 0.5 * (user_activate_degree_dict[u] + user_activate_degree_dict[v]) \n",
+ " u2u_sim[u][v] += activate_weight / math.log(len(user_time_list) + 1)\n",
+ " \n",
+ " u2u_sim_ = u2u_sim.copy()\n",
+ " for u, related_users in u2u_sim.items():\n",
+ " for v, wij in related_users.items():\n",
+ " u2u_sim_[u][v] = wij / math.sqrt(user_cnt[u] * user_cnt[v])\n",
+ " \n",
+ " # 将得到的相似性矩阵保存到本地\n",
+ " pickle.dump(u2u_sim_, open(save_path + 'usercf_u2u_sim.pkl', 'wb'))\n",
+ "\n",
+ " return u2u_sim_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T06:59:46.701572Z",
+ "start_time": "2020-11-16T06:59:26.852246Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 由于usercf计算时候太耗费内存了,这里就不直接运行了\n",
+ "# 如果是采样的话,是可以运行的\n",
+ "user_activate_degree_dict = get_user_activate_degree_dict(all_click_df)\n",
+ "u2u_sim = usercf_sim(all_click_df, user_activate_degree_dict)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### item embedding sim\n",
+ "\n",
+ "使用Embedding计算item之间的相似度是为了后续冷启动的时候可以获取未出现在点击数据中的文章,后面有对冷启动专门的介绍,这里简单的说一下faiss。\n",
+ "\n",
+ "aiss是Facebook的AI团队开源的一套用于做聚类或者相似性搜索的软件库,底层是用C++实现。Faiss因为超级优越的性能,被广泛应用于推荐相关的业务当中.\n",
+ "\n",
+ "faiss工具包一般使用在推荐系统中的向量召回部分。在做向量召回的时候要么是u2u,u2i或者i2i,这里的u和i指的是user和item.我们知道在实际的场景中user和item的数量都是海量的,我们最容易想到的基于向量相似度的召回就是使用两层循环遍历user列表或者item列表计算两个向量的相似度,但是这样做在面对海量数据是不切实际的,faiss就是用来加速计算某个查询向量最相似的topk个索引向量。\n",
+ "\n",
+ "**faiss查询的原理:**\n",
+ "\n",
+ "faiss使用了PCA和PQ(Product quantization乘积量化)两种技术进行向量压缩和编码,当然还使用了其他的技术进行优化,但是PCA和PQ是其中最核心部分。\n",
+ "\n",
+ "1. PCA降维算法细节参考下面这个链接进行学习 \n",
+ "[主成分分析(PCA)原理总结](https://www.cnblogs.com/pinard/p/6239403.html) \n",
+ "\n",
+ "2. PQ编码的细节下面这个链接进行学习 \n",
+ "[实例理解product quantization算法](http://www.fabwrite.com/productquantization)\n",
+ "\n",
+ "**faiss使用**\n",
+ "\n",
+ "[faiss官方教程](https://github.com/facebookresearch/faiss/wiki/Getting-started)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:11:28.631803Z",
+ "start_time": "2020-11-16T09:11:28.619926Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 向量检索相似度计算\n",
+ "# topk指的是每个item, faiss搜索后返回最相似的topk个item\n",
+ "def embdding_sim(click_df, item_emb_df, save_path, topk):\n",
+ " \"\"\"\n",
+ " 基于内容的文章embedding相似性矩阵计算\n",
+ " :param click_df: 数据表\n",
+ " :param item_emb_df: 文章的embedding\n",
+ " :param save_path: 保存路径\n",
+ " :patam topk: 找最相似的topk篇\n",
+ " return 文章相似性矩阵\n",
+ " \n",
+ " 思路: 对于每一篇文章, 基于embedding的相似性返回topk个与其最相似的文章, 只不过由于文章数量太多,这里用了faiss进行加速\n",
+ " \"\"\"\n",
+ " \n",
+ " # 文章索引与文章id的字典映射\n",
+ " item_idx_2_rawid_dict = dict(zip(item_emb_df.index, item_emb_df['article_id']))\n",
+ " \n",
+ " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
+ " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols].values, dtype=np.float32)\n",
+ " # 向量进行单位化\n",
+ " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
+ " \n",
+ " # 建立faiss索引\n",
+ " item_index = faiss.IndexFlatIP(item_emb_np.shape[1])\n",
+ " item_index.add(item_emb_np)\n",
+ " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
+ " sim, idx = item_index.search(item_emb_np, topk) # 返回的是列表\n",
+ " \n",
+ " # 将向量检索的结果保存成原始id的对应关系\n",
+ " item_sim_dict = collections.defaultdict(dict)\n",
+ " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(item_emb_np)), sim, idx)):\n",
+ " target_raw_id = item_idx_2_rawid_dict[target_idx]\n",
+ " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
+ " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
+ " rele_raw_id = item_idx_2_rawid_dict[rele_idx]\n",
+ " item_sim_dict[target_raw_id][rele_raw_id] = item_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
+ " \n",
+ " # 保存i2i相似度矩阵\n",
+ " pickle.dump(item_sim_dict, open(save_path + 'emb_i2i_sim.pkl', 'wb')) \n",
+ " \n",
+ " return item_sim_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:32:35.926116Z",
+ "start_time": "2020-11-16T09:11:44.586967Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "364047it [00:23, 15292.14it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "item_emb_df = pd.read_csv(data_path + '/articles_emb.csv')\n",
+ "emb_i2i_sim = embdding_sim(all_click_df, item_emb_df, save_path, topk=10) # topk可以自行设置"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 召回\n",
+ "这个就是我们开篇提到的那个问题, 面的36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模? 我们就可以再召回阶段筛选出用户对于点击文章的候选集合, 从而降低问题的规模。召回常用的策略:\n",
+ "* Youtube DNN 召回\n",
+ "* 基于文章的召回\n",
+ " * 文章的协同过滤\n",
+ " * 基于文章embedding的召回\n",
+ "* 基于用户的召回\n",
+ " * 用户的协同过滤\n",
+ " * 用户embedding\n",
+ "\n",
+ "上面的各种召回方式一部分在基于用户已经看得文章的基础上去召回与这些文章相似的一些文章, 而这个相似性的计算方式不同, 就得到了不同的召回方式, 比如文章的协同过滤, 文章内容的embedding等。还有一部分是根据用户的相似性进行推荐,对于某用户推荐与其相似的其他用户看过的文章,比如用户的协同过滤和用户embedding。 还有一种思路是类似矩阵分解的思路,先计算出用户和文章的embedding之后,就可以直接算用户和文章的相似度, 根据这个相似度进行推荐, 比如YouTube DNN。 我们下面详细来看一下每一个召回方法:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### YoutubeDNN召回\n",
+ "**(这一步是直接获取用户召回的候选文章列表)**\n",
+ "\n",
+ "[论文下载地址](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)\n",
+ "\n",
+ "**Youtubednn召回架构**\n",
+ "\n",
+ "![image-20201111160516562](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201111160516562.png)\n",
+ "\n",
+ "\n",
+ "\n",
+ "关于YoutubeDNN原理和应用推荐看王喆的两篇博客:\n",
+ "\n",
+ "1. [重读Youtube深度学习推荐系统论文,字字珠玑,惊为神文](https://zhuanlan.zhihu.com/p/52169807)\n",
+ "2. [YouTube深度学习推荐系统的十大工程问题](https://zhuanlan.zhihu.com/p/52504407)\n",
+ "\n",
+ "\n",
+ "**参考文献:**\n",
+ "1. https://zhuanlan.zhihu.com/p/52169807 (YouTubeDNN原理)\n",
+ "2. https://zhuanlan.zhihu.com/p/26306795 (Word2Vec知乎众赞文章) --- word2vec放到排序中的w2v的介绍部分\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:13:11.058766Z",
+ "start_time": "2020-11-16T10:13:11.041084Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取双塔召回时的训练验证数据\n",
+ "# negsample指的是通过滑窗构建样本的时候,负样本的数量\n",
+ "def gen_data_set(data, negsample=0):\n",
+ " data.sort_values(\"click_timestamp\", inplace=True)\n",
+ " item_ids = data['click_article_id'].unique()\n",
+ "\n",
+ " train_set = []\n",
+ " test_set = []\n",
+ " for reviewerID, hist in tqdm(data.groupby('user_id')):\n",
+ " pos_list = hist['click_article_id'].tolist()\n",
+ " \n",
+ " if negsample > 0:\n",
+ " candidate_set = list(set(item_ids) - set(pos_list)) # 用户没看过的文章里面选择负样本\n",
+ " neg_list = np.random.choice(candidate_set,size=len(pos_list)*negsample,replace=True) # 对于每个正样本,选择n个负样本\n",
+ " \n",
+ " # 长度只有一个的时候,需要把这条数据也放到训练集中,不然的话最终学到的embedding就会有缺失\n",
+ " if len(pos_list) == 1:\n",
+ " train_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
+ " test_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
+ " \n",
+ " # 滑窗构造正负样本\n",
+ " for i in range(1, len(pos_list)):\n",
+ " hist = pos_list[:i]\n",
+ " \n",
+ " if i != len(pos_list) - 1:\n",
+ " train_set.append((reviewerID, hist[::-1], pos_list[i], 1, len(hist[::-1]))) # 正样本 [user_id, his_item, pos_item, label, len(his_item)]\n",
+ " for negi in range(negsample):\n",
+ " train_set.append((reviewerID, hist[::-1], neg_list[i*negsample+negi], 0,len(hist[::-1]))) # 负样本 [user_id, his_item, neg_item, label, len(his_item)]\n",
+ " else:\n",
+ " # 将最长的那一个序列长度作为测试数据\n",
+ " test_set.append((reviewerID, hist[::-1], pos_list[i],1,len(hist[::-1])))\n",
+ " \n",
+ " random.shuffle(train_set)\n",
+ " random.shuffle(test_set)\n",
+ " \n",
+ " return train_set, test_set\n",
+ "\n",
+ "# 将输入的数据进行padding,使得序列特征的长度都一致\n",
+ "def gen_model_input(train_set,user_profile,seq_max_len):\n",
+ "\n",
+ " train_uid = np.array([line[0] for line in train_set])\n",
+ " train_seq = [line[1] for line in train_set]\n",
+ " train_iid = np.array([line[2] for line in train_set])\n",
+ " train_label = np.array([line[3] for line in train_set])\n",
+ " train_hist_len = np.array([line[4] for line in train_set])\n",
+ "\n",
+ " train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0)\n",
+ " train_model_input = {\"user_id\": train_uid, \"click_article_id\": train_iid, \"hist_article_id\": train_seq_pad,\n",
+ " \"hist_len\": train_hist_len}\n",
+ "\n",
+ " return train_model_input, train_label"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:13:18.124452Z",
+ "start_time": "2020-11-16T10:13:18.098284Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def youtubednn_u2i_dict(data, topk=20): \n",
+ " sparse_features = [\"click_article_id\", \"user_id\"]\n",
+ " SEQ_LEN = 30 # 用户点击序列的长度,短的填充,长的截断\n",
+ " \n",
+ " user_profile_ = data[[\"user_id\"]].drop_duplicates('user_id')\n",
+ " item_profile_ = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
+ " \n",
+ " # 类别编码\n",
+ " features = [\"click_article_id\", \"user_id\"]\n",
+ " feature_max_idx = {}\n",
+ " \n",
+ " for feature in features:\n",
+ " lbe = LabelEncoder()\n",
+ " data[feature] = lbe.fit_transform(data[feature])\n",
+ " feature_max_idx[feature] = data[feature].max() + 1\n",
+ " \n",
+ " # 提取user和item的画像,这里具体选择哪些特征还需要进一步的分析和考虑\n",
+ " user_profile = data[[\"user_id\"]].drop_duplicates('user_id')\n",
+ " item_profile = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
+ " \n",
+ " user_index_2_rawid = dict(zip(user_profile['user_id'], user_profile_['user_id']))\n",
+ " item_index_2_rawid = dict(zip(item_profile['click_article_id'], item_profile_['click_article_id']))\n",
+ " \n",
+ " # 划分训练和测试集\n",
+ " # 由于深度学习需要的数据量通常都是非常大的,所以为了保证召回的效果,往往会通过滑窗的形式扩充训练样本\n",
+ " train_set, test_set = gen_data_set(data, 0)\n",
+ " # 整理输入数据,具体的操作可以看上面的函数\n",
+ " train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)\n",
+ " test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)\n",
+ " \n",
+ " # 确定Embedding的维度\n",
+ " embedding_dim = 16\n",
+ " \n",
+ " # 将数据整理成模型可以直接输入的形式\n",
+ " user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),\n",
+ " VarLenSparseFeat(SparseFeat('hist_article_id', feature_max_idx['click_article_id'], embedding_dim,\n",
+ " embedding_name=\"click_article_id\"), SEQ_LEN, 'mean', 'hist_len'),]\n",
+ " item_feature_columns = [SparseFeat('click_article_id', feature_max_idx['click_article_id'], embedding_dim)]\n",
+ " \n",
+ " # 模型的定义 \n",
+ " # num_sampled: 负采样时的样本数量\n",
+ " model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, embedding_dim))\n",
+ " # 模型编译\n",
+ " model.compile(optimizer=\"adam\", loss=sampledsoftmaxloss) \n",
+ " \n",
+ " # 模型训练,这里可以定义验证集的比例,如果设置为0的话就是全量数据直接进行训练\n",
+ " history = model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1, validation_split=0.0)\n",
+ " \n",
+ " # 训练完模型之后,提取训练的Embedding,包括user端和item端\n",
+ " test_user_model_input = test_model_input\n",
+ " all_item_model_input = {\"click_article_id\": item_profile['click_article_id'].values}\n",
+ "\n",
+ " user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)\n",
+ " item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)\n",
+ " \n",
+ " # 保存当前的item_embedding 和 user_embedding 排序的时候可能能够用到,但是需要注意保存的时候需要和原始的id对应\n",
+ " user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12)\n",
+ " item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12)\n",
+ " \n",
+ " # embedding保存之前归一化一下\n",
+ " user_embs = user_embs / np.linalg.norm(user_embs, axis=1, keepdims=True)\n",
+ " item_embs = item_embs / np.linalg.norm(item_embs, axis=1, keepdims=True)\n",
+ " \n",
+ " # 将Embedding转换成字典的形式方便查询\n",
+ " raw_user_id_emb_dict = {user_index_2_rawid[k]: \\\n",
+ " v for k, v in zip(user_profile['user_id'], user_embs)}\n",
+ " raw_item_id_emb_dict = {item_index_2_rawid[k]: \\\n",
+ " v for k, v in zip(item_profile['click_article_id'], item_embs)}\n",
+ " # 将Embedding保存到本地\n",
+ " pickle.dump(raw_user_id_emb_dict, open(save_path + 'user_youtube_emb.pkl', 'wb'))\n",
+ " pickle.dump(raw_item_id_emb_dict, open(save_path + 'item_youtube_emb.pkl', 'wb'))\n",
+ " \n",
+ " # faiss紧邻搜索,通过user_embedding 搜索与其相似性最高的topk个item\n",
+ " index = faiss.IndexFlatIP(embedding_dim)\n",
+ " # 上面已经进行了归一化,这里可以不进行归一化了\n",
+ "# faiss.normalize_L2(user_embs)\n",
+ "# faiss.normalize_L2(item_embs)\n",
+ " index.add(item_embs) # 将item向量构建索引\n",
+ " sim, idx = index.search(np.ascontiguousarray(user_embs), topk) # 通过user去查询最相似的topk个item\n",
+ " \n",
+ " user_recall_items_dict = collections.defaultdict(dict)\n",
+ " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(test_user_model_input['user_id'], sim, idx)):\n",
+ " target_raw_id = user_index_2_rawid[target_idx]\n",
+ " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
+ " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
+ " rele_raw_id = item_index_2_rawid[rele_idx]\n",
+ " user_recall_items_dict[target_raw_id][rele_raw_id] = user_recall_items_dict.get(target_raw_id, {})\\\n",
+ " .get(rele_raw_id, 0) + sim_value\n",
+ " \n",
+ " user_recall_items_dict = {k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in user_recall_items_dict.items()}\n",
+ " # 将召回的结果进行排序\n",
+ " \n",
+ " # 保存召回的结果\n",
+ " # 这里是直接通过向量的方式得到了召回结果,相比于上面的召回方法,上面的只是得到了i2i及u2u的相似性矩阵,还需要进行协同过滤召回才能得到召回结果\n",
+ " # 可以直接对这个召回结果进行评估,为了方便可以统一写一个评估函数对所有的召回结果进行评估\n",
+ " pickle.dump(user_recall_items_dict, open(save_path + 'youtube_u2i_dict.pkl', 'wb'))\n",
+ " return user_recall_items_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:21:46.420014Z",
+ "start_time": "2020-11-16T10:13:35.351131Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [02:02<00:00, 2038.57it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "keep_dims is deprecated, use keepdims instead\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Deprecated in favor of operator or tf.math.divide.\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
+ "1149673/1149673 [==============================] - 216s 188us/sample - loss: 0.1326\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "250000it [00:32, 7720.75it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 由于这里需要做召回评估,所以讲训练集中的最后一次点击都提取了出来\n",
+ "if not metric_recall:\n",
+ " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(all_click_df, topk=20)\n",
+ "else:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(trn_hist_click_df, topk=20)\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['youtubednn_recall'], trn_last_click_df, topk=20)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### itemcf recall\n",
+ "\n",
+ "上面已经通过协同过滤,Embedding检索的方式得到了文章的相似度矩阵,下面使用协同过滤的思想,给用户召回与其历史文章相似的文章。\n",
+ "这里在召回的时候,也是用了关联规则的方式:\n",
+ "1. 考虑相似文章与历史点击文章顺序的权重(细节看代码)\n",
+ "2. 考虑文章创建时间的权重,也就是考虑相似文章与历史点击文章创建时间差的权重\n",
+ "3. 考虑文章内容相似度权重(使用Embedding计算相似文章相似度,但是这里需要注意,在Embedding的时候并没有计算所有商品两两之间的相似度,所以相似的文章与历史点击文章不存在相似度,需要做特殊处理)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:48:40.580553Z",
+ "start_time": "2020-11-16T11:48:40.567130Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于商品的召回i2i\n",
+ "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
+ " \"\"\"\n",
+ " 基于文章协同过滤的召回\n",
+ " :param user_id: 用户id\n",
+ " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ " :param i2i_sim: 字典,文章相似性矩阵\n",
+ " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
+ " :param recall_item_num: 整数, 最后的召回文章数量\n",
+ " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
+ " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
+ " \n",
+ " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
+ " \"\"\"\n",
+ " # 获取用户历史交互的文章\n",
+ " user_hist_items = user_item_time_dict[user_id]\n",
+ " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
+ " \n",
+ " item_rank = {}\n",
+ " for loc, (i, click_time) in enumerate(user_hist_items):\n",
+ " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
+ " if j in user_hist_items_:\n",
+ " continue\n",
+ " \n",
+ " # 文章创建时间差权重\n",
+ " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
+ " # 相似文章和历史点击文章序列中历史文章所在的位置权重\n",
+ " loc_weight = (0.9 ** (len(user_hist_items) - loc))\n",
+ " \n",
+ " content_weight = 1.0\n",
+ " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
+ " content_weight += emb_i2i_sim[i][j]\n",
+ " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
+ " content_weight += emb_i2i_sim[j][i]\n",
+ " \n",
+ " item_rank.setdefault(j, 0)\n",
+ " item_rank[j] += created_time_weight * loc_weight * content_weight * wij\n",
+ " \n",
+ " # 不足10个,用热门商品补全\n",
+ " if len(item_rank) < recall_item_num:\n",
+ " for i, item in enumerate(item_topk_click):\n",
+ " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
+ " continue\n",
+ " item_rank[item] = - i - 100 # 随便给个负数就行\n",
+ " if len(item_rank) == recall_item_num:\n",
+ " break\n",
+ " \n",
+ " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
+ " \n",
+ " return item_rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### itemcf sim召回"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T14:41:23.433038Z",
+ "start_time": "2020-11-16T11:48:46.286350Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [2:51:13<00:00, 24.33it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 先进行itemcf召回, 为了召回评估,所以提取最后一次点击\n",
+ "\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "\n",
+ "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
+ "emb_i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb'))\n",
+ "\n",
+ "sim_item_topk = 20\n",
+ "recall_item_num = 10\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, \\\n",
+ " i2i_sim, sim_item_topk, recall_item_num, \\\n",
+ " item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
+ "\n",
+ "user_multi_recall_dict['itemcf_sim_itemcf_recall'] = user_recall_items_dict\n",
+ "pickle.dump(user_multi_recall_dict['itemcf_sim_itemcf_recall'], open(save_path + 'itemcf_recall_dict.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['itemcf_sim_itemcf_recall'], trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### embedding sim 召回"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T15:04:51.527795Z",
+ "start_time": "2020-11-16T14:59:03.907519Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [04:35<00:00, 905.85it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 这里是为了召回评估,所以提取最后一次点击\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
+ "\n",
+ "sim_item_topk = 20\n",
+ "recall_item_num = 10\n",
+ "\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
+ " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
+ " \n",
+ "user_multi_recall_dict['embedding_sim_item_recall'] = user_recall_items_dict\n",
+ "pickle.dump(user_multi_recall_dict['embedding_sim_item_recall'], open(save_path + 'embedding_sim_item_recall.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['embedding_sim_item_recall'], trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### usercf召回\n",
+ "\n",
+ "基于用户协同过滤,核心思想是给用户推荐与其相似的用户历史点击文章,因为这里涉及到了相似用户的历史文章,这里仍然可以加上一些关联规则来给用户可能点击的文章进行加权,这里使用的关联规则主要是考虑相似用户的历史点击文章与被推荐用户历史点击商品的关系权重,而这里的关系就可以直接借鉴基于物品的协同过滤相似的做法,只不过这里是对被推荐物品关系的一个累加的过程,下面是使用的一些关系权重,及相关的代码:\n",
+ "\n",
+ "1. 计算被推荐用户历史点击文章与相似用户历史点击文章的相似度,文章创建时间差,相对位置的总和,作为各自的权重"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:09:32.293990Z",
+ "start_time": "2020-11-17T02:09:32.278678Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于用户的召回 u2u2i\n",
+ "def user_based_recommend(user_id, user_item_time_dict, u2u_sim, sim_user_topk, recall_item_num, \n",
+ " item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
+ " \"\"\"\n",
+ " 基于文章协同过滤的召回\n",
+ " :param user_id: 用户id\n",
+ " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ " :param u2u_sim: 字典,文章相似性矩阵\n",
+ " :param sim_user_topk: 整数, 选择与当前用户最相似的前k个用户\n",
+ " :param recall_item_num: 整数, 最后的召回文章数量\n",
+ " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
+ " :param item_created_time_dict: 文章创建时间列表\n",
+ " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
+ " \n",
+ " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
+ " \"\"\"\n",
+ " # 历史交互\n",
+ " user_item_time_list = user_item_time_dict[user_id] # [(item1, time1), (item2, time2)..]\n",
+ " user_hist_items = set([i for i, t in user_item_time_list]) # 存在一个用户与某篇文章的多次交互, 这里得去重\n",
+ " \n",
+ " items_rank = {}\n",
+ " for sim_u, wuv in sorted(u2u_sim[user_id].items(), key=lambda x: x[1], reverse=True)[:sim_user_topk]:\n",
+ " for i, click_time in user_item_time_dict[sim_u]:\n",
+ " if i in user_hist_items:\n",
+ " continue\n",
+ " items_rank.setdefault(i, 0)\n",
+ " \n",
+ " loc_weight = 1.0\n",
+ " content_weight = 1.0\n",
+ " created_time_weight = 1.0\n",
+ " \n",
+ " # 当前文章与该用户看的历史文章进行一个权重交互\n",
+ " for loc, (j, click_time) in enumerate(user_item_time_list):\n",
+ " # 点击时的相对位置权重\n",
+ " loc_weight += 0.9 ** (len(user_item_time_list) - loc)\n",
+ " # 内容相似性权重\n",
+ " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
+ " content_weight += emb_i2i_sim[i][j]\n",
+ " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
+ " content_weight += emb_i2i_sim[j][i]\n",
+ " \n",
+ " # 创建时间差权重\n",
+ " created_time_weight += np.exp(0.8 * np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
+ " \n",
+ " items_rank[i] += loc_weight * content_weight * created_time_weight * wuv\n",
+ " \n",
+ " # 热度补全\n",
+ " if len(items_rank) < recall_item_num:\n",
+ " for i, item in enumerate(item_topk_click):\n",
+ " if item in items_rank.items(): # 填充的item应该不在原来的列表中\n",
+ " continue\n",
+ " items_rank[item] = - i - 100 # 随便给个复数就行\n",
+ " if len(items_rank) == recall_item_num:\n",
+ " break\n",
+ " \n",
+ " items_rank = sorted(items_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num] \n",
+ " \n",
+ " return items_rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### usercf sim召回"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:05:41.652501Z",
+ "start_time": "2020-11-16T07:05:40.953871Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 这里是为了召回评估,所以提取最后一次点击\n",
+ "# 由于usercf中计算user之间的相似度的过程太费内存了,全量数据这里就没有跑,跑了一个采样之后的数据\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ " \n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "\n",
+ "u2u_sim = pickle.load(open(save_path + 'usercf_u2u_sim.pkl', 'rb'))\n",
+ "\n",
+ "sim_user_topk = 20\n",
+ "recall_item_num = 10\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
+ " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim) \n",
+ "\n",
+ "pickle.dump(user_recall_items_dict, open(save_path + 'usercf_u2u2i_recall.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_recall_items_dict, trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T03:09:35.853516Z",
+ "start_time": "2020-11-16T03:09:35.737625Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### user embedding sim召回\n",
+ "\n",
+ "虽然没有直接跑usercf的计算用户之间的相似度,为了验证上述基于用户的协同过滤的代码,下面使用了YoutubeDNN过程中产生的user embedding来进行向量检索每个user最相似的topk个user,在使用这里得到的u2u的相似性矩阵,使用usercf进行召回,具体代码如下"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:09:46.807811Z",
+ "start_time": "2020-11-17T02:09:46.798033Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 使用Embedding的方式获取u2u的相似性矩阵\n",
+ "# topk指的是每个user, faiss搜索后返回最相似的topk个user\n",
+ "def u2u_embdding_sim(click_df, user_emb_dict, save_path, topk):\n",
+ " \n",
+ " user_list = []\n",
+ " user_emb_list = []\n",
+ " for user_id, user_emb in user_emb_dict.items():\n",
+ " user_list.append(user_id)\n",
+ " user_emb_list.append(user_emb)\n",
+ " \n",
+ " user_index_2_rawid_dict = {k: v for k, v in zip(range(len(user_list)), user_list)} \n",
+ " \n",
+ " user_emb_np = np.array(user_emb_list, dtype=np.float32)\n",
+ " \n",
+ " # 建立faiss索引\n",
+ " user_index = faiss.IndexFlatIP(user_emb_np.shape[1])\n",
+ " user_index.add(user_emb_np)\n",
+ " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
+ " sim, idx = user_index.search(user_emb_np, topk) # 返回的是列表\n",
+ " \n",
+ " # 将向量检索的结果保存成原始id的对应关系\n",
+ " user_sim_dict = collections.defaultdict(dict)\n",
+ " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(user_emb_np)), sim, idx)):\n",
+ " target_raw_id = user_index_2_rawid_dict[target_idx]\n",
+ " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
+ " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
+ " rele_raw_id = user_index_2_rawid_dict[rele_idx]\n",
+ " user_sim_dict[target_raw_id][rele_raw_id] = user_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
+ " \n",
+ " # 保存i2i相似度矩阵\n",
+ " pickle.dump(user_sim_dict, open(save_path + 'youtube_u2u_sim.pkl', 'wb')) \n",
+ " return user_sim_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:14:31.355905Z",
+ "start_time": "2020-11-17T02:09:53.236531Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "250000it [00:23, 10507.45it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 读取YoutubeDNN过程中产生的user embedding, 然后使用faiss计算用户之间的相似度\n",
+ "# 这里需要注意,这里得到的user embedding其实并不是很好,因为YoutubeDNN中使用的是用户点击序列来训练的user embedding,\n",
+ "# 如果序列普遍都比较短的话,其实效果并不是很好\n",
+ "user_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
+ "u2u_sim = u2u_embdding_sim(all_click_df, user_emb_dict, save_path, topk=10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "通过YoutubeDNN得到的user_embedding"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:49:40.755431Z",
+ "start_time": "2020-11-17T02:28:47.003514Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [19:43<00:00, 211.22it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 使用召回评估函数验证当前召回方式的效果\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "u2u_sim = pickle.load(open(save_path + 'youtube_u2u_sim.pkl', 'rb'))\n",
+ "\n",
+ "sim_user_topk = 20\n",
+ "recall_item_num = 10\n",
+ "\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
+ " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
+ " \n",
+ "user_multi_recall_dict['youtubednn_usercf_recall'] = user_recall_items_dict\n",
+ "pickle.dump(user_multi_recall_dict['youtubednn_usercf_recall'], open(save_path + 'youtubednn_usercf_recall.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['youtubednn_usercf_recall'], trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:07:44.326253Z",
+ "start_time": "2020-11-16T07:07:43.798931Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 冷启动问题"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**冷启动问题可以分成三类:文章冷启动,用户冷启动,系统冷启动。**\n",
+ "\n",
+ "- 文章冷启动:对于一个平台系统新加入的文章,该文章没有任何的交互记录,如何推荐给用户的问题。(对于我们场景可以认为是,日志数据中没有出现过的文章都可以认为是冷启动的文章)\n",
+ "- 用户冷启动:对于一个平台系统新来的用户,该用户还没有文章的交互信息,如何给该用户进行推荐。(对于我们场景就是,测试集中的用户是否在测试集对应的log数据中出现过,如果没有出现过,那么可以认为该用户是冷启动用户。但是有时候并没有这么严格,我们也可以自己设定某些指标来判别哪些用户是冷启动用户,比如通过使用时长,点击率,留存率等等)\n",
+ "- 系统冷启动:就是对于一个平台刚上线,还没有任何的相关历史数据,此时就是系统冷启动,其实也就是前面两种的一个综合。\n",
+ "\n",
+ "**当前场景下冷启动问题的分析:**\n",
+ "\n",
+ "对当前的数据进行分析会发现,日志中所有出现过的点击文章只有3w多个,而整个文章库中却有30多万,那么测试集中的用户最后一次点击是否会点击没有出现在日志中的文章呢?如果存在这种情况,说明用户点击的文章之前没有任何的交互信息,这也就是我们所说的文章冷启动。通过数据分析还可以发现,测试集用户只有一次点击的数据占得比例还不少,其实仅仅通过用户的一次点击就给用户推荐文章使用模型的方式也是比较难的,这里其实也可以考虑用户冷启动的问题,但是这里只给出物品冷启动的一些解决方案及代码,关于用户冷启动的话提一些可行性的做法。\n",
+ "\n",
+ "1. 文章冷启动(没有冷启动的探索问题) \n",
+ " 其实我们这里不是为了做文章的冷启动而做冷启动,而是猜测用户可能会点击一些没有在log数据中出现的文章,我们要做的就是如何从将近27万的文章中选择一些文章作为用户冷启动的文章,这里其实也可以看成是一种召回策略,我们这里就采用简单的比较好理解的基于规则的召回策略来获取用户可能点击的未出现在log数据中的文章。\n",
+ " 现在的问题变成了:如何给每个用户考虑从27万个商品中获取一小部分商品?随机选一些可能是一种方案。下面给出一些参考的方案。\n",
+ " 1. 首先基于Embedding召回一部分与用户历史相似的文章\n",
+ " 2. 从基于Embedding召回的文章中通过一些规则过滤掉一些文章,使得留下的文章用户更可能点击。我们这里的规则,可以是,留下那些与用户历史点击文章主题相同的文章,或者字数相差不大的文章。并且留下的文章尽量是与测试集用户最后一次点击时间更接近的文章,或者是当天的文章也行。\n",
+ "2. 用户冷启动 \n",
+ " 这里对测试集中的用户点击数据进行分析会发现,测试集中有百分之20的用户只有一次点击,那么这些点击特别少的用户的召回是不是可以单独做一些策略上的补充呢?或者是在排序后直接基于规则加上一些文章呢?这些都可以去尝试,这里没有提供具体的做法。\n",
+ " \n",
+ "**注意:** \n",
+ "\n",
+ "这里看似和基于embedding计算的item之间相似度然后做itemcf是一致的,但是现在我们的目的不一样,我们这里的目的是找到相似的向量,并且还没有出现在log日志中的商品,再加上一些其他的冷启动的策略,这里需要找回的数量会偏多一点,不然被筛选完之后可能都没有文章了"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T04:30:23.027164Z",
+ "start_time": "2020-11-17T04:23:09.960235Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [05:01<00:00, 828.60it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 先进行itemcf召回,这里不需要做召回评估,这里只是一种策略\n",
+ "trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
+ "\n",
+ "sim_item_topk = 150\n",
+ "recall_item_num = 100 # 稍微召回多一点文章,便于后续的规则筛选\n",
+ "\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
+ " recall_item_num, item_topk_click,item_created_time_dict, emb_i2i_sim)\n",
+ "pickle.dump(user_recall_items_dict, open(save_path + 'cold_start_items_raw_dict.pkl', 'wb'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:11:39.267581Z",
+ "start_time": "2020-11-17T06:11:39.252563Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于规则进行文章过滤\n",
+ "# 保留文章主题与用户历史浏览主题相似的文章\n",
+ "# 保留文章字数与用户历史浏览文章字数相差不大的文章\n",
+ "# 保留最后一次点击当天的文章\n",
+ "# 按照相似度返回最终的结果\n",
+ "\n",
+ "def get_click_article_ids_set(all_click_df):\n",
+ " return set(all_click_df.click_article_id.values)\n",
+ "\n",
+ "def cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
+ " user_last_item_created_time_dict, item_type_dict, item_words_dict, \n",
+ " item_created_time_dict, click_article_ids_set, recall_item_num):\n",
+ " \"\"\"\n",
+ " 冷启动的情况下召回一些文章\n",
+ " :param user_recall_items_dict: 基于内容embedding相似性召回来的很多文章, 字典, {user1: [(item1, item2), ..], }\n",
+ " :param user_hist_item_typs_dict: 字典, 用户点击的文章的主题映射\n",
+ " :param user_hist_item_words_dict: 字典, 用户点击的历史文章的字数映射\n",
+ " :param user_last_item_created_time_idct: 字典,用户点击的历史文章创建时间映射\n",
+ " :param item_tpye_idct: 字典,文章主题映射\n",
+ " :param item_words_dict: 字典,文章字数映射\n",
+ " :param item_created_time_dict: 字典, 文章创建时间映射\n",
+ " :param click_article_ids_set: 集合,用户点击过得文章, 也就是日志里面出现过的文章\n",
+ " :param recall_item_num: 召回文章的数量, 这个指的是没有出现在日志里面的文章数量\n",
+ " \"\"\"\n",
+ " \n",
+ " cold_start_user_items_dict = {}\n",
+ " for user, item_list in tqdm(user_recall_items_dict.items()):\n",
+ " cold_start_user_items_dict.setdefault(user, [])\n",
+ " for item, score in item_list:\n",
+ " # 获取历史文章信息\n",
+ " hist_item_type_set = user_hist_item_typs_dict[user]\n",
+ " hist_mean_words = user_hist_item_words_dict[user]\n",
+ " hist_last_item_created_time = user_last_item_created_time_dict[user]\n",
+ " hist_last_item_created_time = datetime.fromtimestamp(hist_last_item_created_time)\n",
+ " \n",
+ " # 获取当前召回文章的信息\n",
+ " curr_item_type = item_type_dict[item]\n",
+ " curr_item_words = item_words_dict[item]\n",
+ " curr_item_created_time = item_created_time_dict[item]\n",
+ " curr_item_created_time = datetime.fromtimestamp(curr_item_created_time)\n",
+ "\n",
+ " # 首先,文章不能出现在用户的历史点击中, 然后根据文章主题,文章单词数,文章创建时间进行筛选\n",
+ " if curr_item_type not in hist_item_type_set or \\\n",
+ " item in click_article_ids_set or \\\n",
+ " abs(curr_item_words - hist_mean_words) > 200 or \\\n",
+ " abs((curr_item_created_time - hist_last_item_created_time).days) > 90: \n",
+ " continue\n",
+ " \n",
+ " cold_start_user_items_dict[user].append((item, score)) # {user1: [(item1, score1), (item2, score2)..]...}\n",
+ " \n",
+ " # 需要控制一下冷启动召回的数量\n",
+ " cold_start_user_items_dict = {k: sorted(v, key=lambda x:x[1], reverse=True)[:recall_item_num] \\\n",
+ " for k, v in cold_start_user_items_dict.items()}\n",
+ " \n",
+ " pickle.dump(cold_start_user_items_dict, open(save_path + 'cold_start_user_items_dict.pkl', 'wb'))\n",
+ " \n",
+ " return cold_start_user_items_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:35:38.758278Z",
+ "start_time": "2020-11-17T06:31:40.164332Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [01:49<00:00, 2293.37it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "all_click_df_ = all_click_df.copy()\n",
+ "all_click_df_ = all_click_df_.merge(item_info_df, how='left', on='click_article_id')\n",
+ "user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict = get_user_hist_item_info_dict(all_click_df_)\n",
+ "click_article_ids_set = get_click_article_ids_set(all_click_df)\n",
+ "# 需要注意的是\n",
+ "# 这里使用了很多规则来筛选冷启动的文章,所以前面再召回的阶段就应该尽可能的多召回一些文章,否则很容易被删掉\n",
+ "cold_start_user_items_dict = cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
+ " user_last_item_created_time_dict, item_type_dict, item_words_dict, \\\n",
+ " item_created_time_dict, click_article_ids_set, recall_item_num)\n",
+ "\n",
+ "user_multi_recall_dict['cold_start_recall'] = cold_start_user_items_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:13:33.099298Z",
+ "start_time": "2020-11-16T07:13:32.655036Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 多路召回合并\n",
+ "多路召回合并就是将前面所有的召回策略得到的用户文章列表合并起来,下面是对前面所有召回结果的汇总\n",
+ "1. 基于itemcf计算的item之间的相似度sim进行的召回 \n",
+ "2. 基于embedding搜索得到的item之间的相似度进行的召回\n",
+ "3. YoutubeDNN召回\n",
+ "4. YoutubeDNN得到的user之间的相似度进行的召回\n",
+ "5. 基于冷启动策略的召回\n",
+ "\n",
+ "**注意:** \n",
+ "在做召回评估的时候就会发现有些召回的效果不错有些召回的效果很差,所以对每一路召回的结果,我们可以认为的定义一些权重,来做最终的相似度融合"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:02:16.033971Z",
+ "start_time": "2020-11-17T07:02:16.019819Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def combine_recall_results(user_multi_recall_dict, weight_dict=None, topk=25):\n",
+ " final_recall_items_dict = {}\n",
+ " \n",
+ " # 对每一种召回结果按照用户进行归一化,方便后面多种召回结果,相同用户的物品之间权重相加\n",
+ " def norm_user_recall_items_sim(sorted_item_list):\n",
+ " # 如果冷启动中没有文章或者只有一篇文章,直接返回,出现这种情况的原因可能是冷启动召回的文章数量太少了,\n",
+ " # 基于规则筛选之后就没有文章了, 这里还可以做一些其他的策略性的筛选\n",
+ " if len(sorted_item_list) < 2:\n",
+ " return sorted_item_list\n",
+ " \n",
+ " min_sim = sorted_item_list[-1][1]\n",
+ " max_sim = sorted_item_list[0][1]\n",
+ " \n",
+ " norm_sorted_item_list = []\n",
+ " for item, score in sorted_item_list:\n",
+ " if max_sim > 0:\n",
+ " norm_score = 1.0 * (score - min_sim) / (max_sim - min_sim) if max_sim > min_sim else 1.0\n",
+ " else:\n",
+ " norm_score = 0.0\n",
+ " norm_sorted_item_list.append((item, norm_score))\n",
+ " \n",
+ " return norm_sorted_item_list\n",
+ " \n",
+ " print('多路召回合并...')\n",
+ " for method, user_recall_items in tqdm(user_multi_recall_dict.items()):\n",
+ " print(method + '...')\n",
+ " # 在计算最终召回结果的时候,也可以为每一种召回结果设置一个权重\n",
+ " if weight_dict == None:\n",
+ " recall_method_weight = 1\n",
+ " else:\n",
+ " recall_method_weight = weight_dict[method]\n",
+ " \n",
+ " for user_id, sorted_item_list in user_recall_items.items(): # 进行归一化\n",
+ " user_recall_items[user_id] = norm_user_recall_items_sim(sorted_item_list)\n",
+ " \n",
+ " for user_id, sorted_item_list in user_recall_items.items():\n",
+ " # print('user_id')\n",
+ " final_recall_items_dict.setdefault(user_id, {})\n",
+ " for item, score in sorted_item_list:\n",
+ " final_recall_items_dict[user_id].setdefault(item, 0)\n",
+ " final_recall_items_dict[user_id][item] += recall_method_weight * score \n",
+ " \n",
+ " final_recall_items_dict_rank = {}\n",
+ " # 多路召回时也可以控制最终的召回数量\n",
+ " for user, recall_item_dict in final_recall_items_dict.items():\n",
+ " final_recall_items_dict_rank[user] = sorted(recall_item_dict.items(), key=lambda x: x[1], reverse=True)[:topk]\n",
+ "\n",
+ " # 将多路召回后的最终结果字典保存到本地\n",
+ " pickle.dump(final_recall_items_dict_rank, open(os.path.join(save_path, 'final_recall_items_dict.pkl'),'wb'))\n",
+ "\n",
+ " return final_recall_items_dict_rank"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:02:21.078455Z",
+ "start_time": "2020-11-17T07:02:21.074060Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 这里直接对多路召回的权重给了一个相同的值,其实可以根据前面召回的情况来调整参数的值\n",
+ "weight_dict = {'itemcf_sim_itemcf_recall': 1.0,\n",
+ " 'embedding_sim_item_recall': 1.0,\n",
+ " 'youtubednn_recall': 1.0,\n",
+ " 'youtubednn_usercf_recall': 1.0, \n",
+ " 'cold_start_recall': 1.0}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:04:35.747924Z",
+ "start_time": "2020-11-17T07:02:26.889573Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 0%| | 0/5 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "多路召回合并...\n",
+ "itemcf_sim_itemcf_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 20%|██ | 1/5 [00:08<00:34, 8.66s/it]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "embedding_sim_item_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 40%|████ | 2/5 [00:16<00:24, 8.29s/it]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "youtubednn_recall...\n",
+ "youtubednn_usercf_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 80%|████████ | 4/5 [00:23<00:06, 6.98s/it]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "cold_start_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 5/5 [00:42<00:00, 8.40s/it]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 最终合并之后每个用户召回150个商品进行排序\n",
+ "final_recall_items_dict_rank = combine_recall_results(user_multi_recall_dict, weight_dict, topk=150)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 总结\n",
+ "\n",
+ "上述实现了如下召回策略:\n",
+ "\n",
+ "1. 基于关联规则的itemcf\n",
+ "2. 基于关联规则的usercf\n",
+ "3. youtubednn召回\n",
+ "4. 冷启动召回\n",
+ "\n",
+ "对于上述实现的召回策略其实都不是最优的结果,我们只是做了个简单的尝试,其中还有很多地方可以优化,包括已经实现的这些召回策略的参数或者新加一些,修改一些关联规则都可以。当然还可以尝试更多的召回策略,比如对新闻进行热度召回等等。\n",
+ "\n",
+ "\n",
+ "\n",
+ "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- ],
- "description": "",
- "notebookId": "130009",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "595px",
- "left": "61px",
- "top": "67px",
- "width": "174px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
- },
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.5"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "nbTranslate": {
+ "displayLangs": [
+ "*"
+ ],
+ "hotkey": "alt-t",
+ "langInMainMenu": true,
+ "sourceLang": "en",
+ "targetLang": "fr",
+ "useGoogleTranslate": true
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [
+ {
+ "id": "83580",
+ "title": "零基础入门推荐系统 - 新闻推荐"
+ }
+ ],
+ "description": "",
+ "notebookId": "130009",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "595px",
+ "left": "61px",
+ "top": "67px",
+ "width": "174px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git "a/docs/ch03/ch3.1/jupyter/\346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb" "b/docs/ch03/ch3.1/jupyter/\346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb"
index 5f96e246b..3af0aa71f 100644
--- "a/docs/ch03/ch3.1/jupyter/\346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb"
+++ "b/docs/ch03/ch3.1/jupyter/\346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb"
@@ -1,2689 +1,2689 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 排序模型\n",
- "通过召回的操作, 我们已经进行了问题规模的缩减, 对于每个用户, 选择出了N篇文章作为了候选集,并基于召回的候选集构建了与用户历史相关的特征,以及用户本身的属性特征,文章本省的属性特征,以及用户与文章之间的特征,下面就是使用机器学习模型来对构造好的特征进行学习,然后对测试集进行预测,得到测试集中的每个候选集用户点击的概率,返回点击概率最大的topk个文章,作为最终的结果。\n",
- "\n",
- "排序阶段选择了三个比较有代表性的排序模型,它们分别是:\n",
- "\n",
- "1. LGB的排序模型\n",
- "2. LGB的分类模型\n",
- "3. 深度学习的分类模型DIN\n",
- "\n",
- "得到了最终的排序模型输出的结果之后,还选择了两种比较经典的模型集成的方法:\n",
- "\n",
- "1. 输出结果加权融合\n",
- "2. Staking(将模型的输出结果再使用一个简单模型进行预测)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:20:39.770642Z",
- "start_time": "2020-11-18T04:20:38.500875Z"
- }
- },
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import pandas as pd\n",
- "import pickle\n",
- "from tqdm import tqdm\n",
- "import gc, os\n",
- "import time\n",
- "from datetime import datetime\n",
- "import lightgbm as lgb\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取排序特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:20:41.843180Z",
- "start_time": "2020-11-18T04:20:41.837287Z"
- }
- },
- "outputs": [],
- "source": [
- "data_path = './data_raw/'\n",
- "save_path = './temp_results/'\n",
- "offline = False"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:20:53.358138Z",
- "start_time": "2020-11-18T04:20:44.232944Z"
- }
- },
- "outputs": [],
- "source": [
- "# 重新读取数据的时候,发现click_article_id是一个浮点数,所以将其转换成int类型\n",
- "trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
- "trn_user_item_feats_df['click_article_id'] = trn_user_item_feats_df['click_article_id'].astype(int)\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
- " val_user_item_feats_df['click_article_id'] = val_user_item_feats_df['click_article_id'].astype(int)\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
- "tst_user_item_feats_df['click_article_id'] = tst_user_item_feats_df['click_article_id'].astype(int)\n",
- "\n",
- "# 做特征的时候为了方便,给测试集也打上了一个无效的标签,这里直接删掉就行\n",
- "del tst_user_item_feats_df['label']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 返回排序后的结果"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:01.809368Z",
- "start_time": "2020-11-18T04:21:01.799641Z"
- }
- },
- "outputs": [],
- "source": [
- "def submit(recall_df, topk=5, model_name=None):\n",
- " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
- " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 判断是不是每个用户都有5篇文章及以上\n",
- " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
- " assert tmp.min() >= topk\n",
- " \n",
- " del recall_df['pred_score']\n",
- " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
- " \n",
- " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
- " # 按照提交格式定义列名\n",
- " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
- " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
- " \n",
- " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
- " submit.to_csv(save_name, index=False, header=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:04.332198Z",
- "start_time": "2020-11-18T04:21:04.325020Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序结果归一化\n",
- "def norm_sim(sim_df, weight=0.0):\n",
- " # print(sim_df.head())\n",
- " min_sim = sim_df.min()\n",
- " max_sim = sim_df.max()\n",
- " if max_sim == min_sim:\n",
- " sim_df = sim_df.apply(lambda sim: 1.0)\n",
- " else:\n",
- " sim_df = sim_df.apply(lambda sim: 1.0 * (sim - min_sim) / (max_sim - min_sim))\n",
- "\n",
- " sim_df = sim_df.apply(lambda sim: sim + weight) # plus one\n",
- " return sim_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## LGB排序模型"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:07.787698Z",
- "start_time": "2020-11-18T04:21:07.536514Z"
- }
- },
- "outputs": [],
- "source": [
- "# 防止中间出错之后重新读取数据\n",
- "trn_user_item_feats_df_rank_model = trn_user_item_feats_df.copy()\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_rank_model = val_user_item_feats_df.copy()\n",
- " \n",
- "tst_user_item_feats_df_rank_model = tst_user_item_feats_df.copy()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:10.839656Z",
- "start_time": "2020-11-18T04:21:10.833109Z"
- }
- },
- "outputs": [],
- "source": [
- "# 定义特征列\n",
- "lgb_cols = ['sim0', 'time_diff0', 'word_diff0','sim_max', 'sim_min', 'sim_sum', \n",
- " 'sim_mean', 'score','click_size', 'time_diff_mean', 'active_level',\n",
- " 'click_environment','click_deviceGroup', 'click_os', 'click_country', \n",
- " 'click_region','click_referrer_type', 'user_time_hob1', 'user_time_hob2',\n",
- " 'words_hbo', 'category_id', 'created_at_ts','words_count']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:14.126608Z",
- "start_time": "2020-11-18T04:21:13.493653Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序模型分组\n",
- "trn_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
- "g_train = trn_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
- " g_val = val_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:16.136151Z",
- "start_time": "2020-11-18T04:21:16.124444Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序模型定义\n",
- "lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:22.965433Z",
- "start_time": "2020-11-18T04:21:17.799127Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序模型训练\n",
- "if offline:\n",
- " lgb_ranker.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'], group=g_train,\n",
- " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
- " eval_group= [g_val], eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
- "else:\n",
- " lgb_ranker.fit(trn_user_item_feats_df[lgb_cols], trn_user_item_feats_df['label'], group=g_train)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:28.616665Z",
- "start_time": "2020-11-18T04:21:24.672280Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型预测\n",
- "tst_user_item_feats_df['pred_score'] = lgb_ranker.predict(tst_user_item_feats_df[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
- "\n",
- "# 将这里的排序结果保存一份,用户后面的模型融合\n",
- "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_ranker_score.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:40.253692Z",
- "start_time": "2020-11-18T04:21:30.546587Z"
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 排序模型\n",
+ "通过召回的操作, 我们已经进行了问题规模的缩减, 对于每个用户, 选择出了N篇文章作为了候选集,并基于召回的候选集构建了与用户历史相关的特征,以及用户本身的属性特征,文章本省的属性特征,以及用户与文章之间的特征,下面就是使用机器学习模型来对构造好的特征进行学习,然后对测试集进行预测,得到测试集中的每个候选集用户点击的概率,返回点击概率最大的topk个文章,作为最终的结果。\n",
+ "\n",
+ "排序阶段选择了三个比较有代表性的排序模型,它们分别是:\n",
+ "\n",
+ "1. LGB的排序模型\n",
+ "2. LGB的分类模型\n",
+ "3. 深度学习的分类模型DIN\n",
+ "\n",
+ "得到了最终的排序模型输出的结果之后,还选择了两种比较经典的模型集成的方法:\n",
+ "\n",
+ "1. 输出结果加权融合\n",
+ "2. Staking(将模型的输出结果再使用一个简单模型进行预测)"
+ ]
},
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_ranker')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:22:26.195838Z",
- "start_time": "2020-11-18T04:21:46.115002Z"
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:20:39.770642Z",
+ "start_time": "2020-11-18T04:20:38.500875Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import pickle\n",
+ "from tqdm import tqdm\n",
+ "import gc, os\n",
+ "import time\n",
+ "from datetime import datetime\n",
+ "import lightgbm as lgb\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[1]\tvalid_0's ndcg@1: 0.909975\tvalid_0's ndcg@2: 0.963068\tvalid_0's ndcg@3: 0.96533\tvalid_0's ndcg@4: 0.965729\tvalid_0's ndcg@5: 0.965864\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9143\tvalid_0's ndcg@2: 0.964711\tvalid_0's ndcg@3: 0.966961\tvalid_0's ndcg@4: 0.967338\tvalid_0's ndcg@5: 0.967483\n",
- "[3]\tvalid_0's ndcg@1: 0.9181\tvalid_0's ndcg@2: 0.966114\tvalid_0's ndcg@3: 0.968289\tvalid_0's ndcg@4: 0.968773\tvalid_0's ndcg@5: 0.96887\n",
- "[4]\tvalid_0's ndcg@1: 0.925575\tvalid_0's ndcg@2: 0.969093\tvalid_0's ndcg@3: 0.971193\tvalid_0's ndcg@4: 0.971603\tvalid_0's ndcg@5: 0.97169\n",
- "[5]\tvalid_0's ndcg@1: 0.9267\tvalid_0's ndcg@2: 0.969635\tvalid_0's ndcg@3: 0.97166\tvalid_0's ndcg@4: 0.972037\tvalid_0's ndcg@5: 0.972133\n",
- "[6]\tvalid_0's ndcg@1: 0.927\tvalid_0's ndcg@2: 0.969682\tvalid_0's ndcg@3: 0.971757\tvalid_0's ndcg@4: 0.972134\tvalid_0's ndcg@5: 0.972231\n",
- "[7]\tvalid_0's ndcg@1: 0.928825\tvalid_0's ndcg@2: 0.970451\tvalid_0's ndcg@3: 0.972476\tvalid_0's ndcg@4: 0.97282\tvalid_0's ndcg@5: 0.972927\n",
- "[8]\tvalid_0's ndcg@1: 0.930025\tvalid_0's ndcg@2: 0.970988\tvalid_0's ndcg@3: 0.972951\tvalid_0's ndcg@4: 0.973295\tvalid_0's ndcg@5: 0.973402\n",
- "[9]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971347\tvalid_0's ndcg@3: 0.973384\tvalid_0's ndcg@4: 0.973707\tvalid_0's ndcg@5: 0.973794\n",
- "[10]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.971385\tvalid_0's ndcg@3: 0.973372\tvalid_0's ndcg@4: 0.973717\tvalid_0's ndcg@5: 0.973794\n",
- "[11]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.971433\tvalid_0's ndcg@3: 0.973333\tvalid_0's ndcg@4: 0.973699\tvalid_0's ndcg@5: 0.973767\n",
- "[12]\tvalid_0's ndcg@1: 0.93145\tvalid_0's ndcg@2: 0.971656\tvalid_0's ndcg@3: 0.973493\tvalid_0's ndcg@4: 0.973881\tvalid_0's ndcg@5: 0.973949\n",
- "[13]\tvalid_0's ndcg@1: 0.932525\tvalid_0's ndcg@2: 0.971927\tvalid_0's ndcg@3: 0.973839\tvalid_0's ndcg@4: 0.974227\tvalid_0's ndcg@5: 0.974304\n",
- "[14]\tvalid_0's ndcg@1: 0.932575\tvalid_0's ndcg@2: 0.971898\tvalid_0's ndcg@3: 0.973823\tvalid_0's ndcg@4: 0.974243\tvalid_0's ndcg@5: 0.97432\n",
- "[15]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972239\tvalid_0's ndcg@3: 0.974189\tvalid_0's ndcg@4: 0.974587\tvalid_0's ndcg@5: 0.974665\n",
- "[16]\tvalid_0's ndcg@1: 0.933475\tvalid_0's ndcg@2: 0.972309\tvalid_0's ndcg@3: 0.974209\tvalid_0's ndcg@4: 0.974596\tvalid_0's ndcg@5: 0.974674\n",
- "[17]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972369\tvalid_0's ndcg@3: 0.974307\tvalid_0's ndcg@4: 0.974684\tvalid_0's ndcg@5: 0.974761\n",
- "[18]\tvalid_0's ndcg@1: 0.9339\tvalid_0's ndcg@2: 0.972497\tvalid_0's ndcg@3: 0.974372\tvalid_0's ndcg@4: 0.974749\tvalid_0's ndcg@5: 0.974836\n",
- "[19]\tvalid_0's ndcg@1: 0.9345\tvalid_0's ndcg@2: 0.972845\tvalid_0's ndcg@3: 0.974645\tvalid_0's ndcg@4: 0.974979\tvalid_0's ndcg@5: 0.975085\n",
- "[20]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.97484\tvalid_0's ndcg@4: 0.975174\tvalid_0's ndcg@5: 0.975271\n",
- "[21]\tvalid_0's ndcg@1: 0.935\tvalid_0's ndcg@2: 0.973092\tvalid_0's ndcg@3: 0.97488\tvalid_0's ndcg@4: 0.975192\tvalid_0's ndcg@5: 0.975289\n",
- "[22]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.974988\tvalid_0's ndcg@4: 0.975289\tvalid_0's ndcg@5: 0.975386\n",
- "[23]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974824\tvalid_0's ndcg@4: 0.975136\tvalid_0's ndcg@5: 0.975223\n",
- "[24]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973274\tvalid_0's ndcg@3: 0.975087\tvalid_0's ndcg@4: 0.975388\tvalid_0's ndcg@5: 0.975475\n",
- "[25]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973345\tvalid_0's ndcg@3: 0.97512\tvalid_0's ndcg@4: 0.975443\tvalid_0's ndcg@5: 0.97553\n",
- "[26]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.975\tvalid_0's ndcg@4: 0.975313\tvalid_0's ndcg@5: 0.9754\n",
- "[27]\tvalid_0's ndcg@1: 0.935175\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.974983\tvalid_0's ndcg@4: 0.975295\tvalid_0's ndcg@5: 0.975382\n",
- "[28]\tvalid_0's ndcg@1: 0.935425\tvalid_0's ndcg@2: 0.973328\tvalid_0's ndcg@3: 0.975041\tvalid_0's ndcg@4: 0.975374\tvalid_0's ndcg@5: 0.975471\n",
- "[29]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973225\tvalid_0's ndcg@3: 0.974963\tvalid_0's ndcg@4: 0.975297\tvalid_0's ndcg@5: 0.975403\n",
- "[30]\tvalid_0's ndcg@1: 0.9353\tvalid_0's ndcg@2: 0.973235\tvalid_0's ndcg@3: 0.97501\tvalid_0's ndcg@4: 0.975311\tvalid_0's ndcg@5: 0.975418\n",
- "[31]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973361\tvalid_0's ndcg@3: 0.975099\tvalid_0's ndcg@4: 0.975422\tvalid_0's ndcg@5: 0.975528\n",
- "[32]\tvalid_0's ndcg@1: 0.9364\tvalid_0's ndcg@2: 0.973641\tvalid_0's ndcg@3: 0.975391\tvalid_0's ndcg@4: 0.975714\tvalid_0's ndcg@5: 0.97582\n",
- "[33]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973751\tvalid_0's ndcg@3: 0.975501\tvalid_0's ndcg@4: 0.975824\tvalid_0's ndcg@5: 0.975931\n",
- "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.973902\tvalid_0's ndcg@3: 0.975677\tvalid_0's ndcg@4: 0.975989\tvalid_0's ndcg@5: 0.976095\n",
- "[35]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974105\tvalid_0's ndcg@3: 0.975892\tvalid_0's ndcg@4: 0.976194\tvalid_0's ndcg@5: 0.9763\n",
- "[36]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974184\tvalid_0's ndcg@3: 0.975984\tvalid_0's ndcg@4: 0.976296\tvalid_0's ndcg@5: 0.976402\n",
- "[37]\tvalid_0's ndcg@1: 0.93845\tvalid_0's ndcg@2: 0.974366\tvalid_0's ndcg@3: 0.976166\tvalid_0's ndcg@4: 0.976467\tvalid_0's ndcg@5: 0.976574\n",
- "[38]\tvalid_0's ndcg@1: 0.938925\tvalid_0's ndcg@2: 0.974557\tvalid_0's ndcg@3: 0.976332\tvalid_0's ndcg@4: 0.976655\tvalid_0's ndcg@5: 0.976751\n",
- "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974471\tvalid_0's ndcg@3: 0.976234\tvalid_0's ndcg@4: 0.976557\tvalid_0's ndcg@5: 0.976653\n",
- "[40]\tvalid_0's ndcg@1: 0.938325\tvalid_0's ndcg@2: 0.974335\tvalid_0's ndcg@3: 0.97611\tvalid_0's ndcg@4: 0.976433\tvalid_0's ndcg@5: 0.97653\n",
- "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.974669\tvalid_0's ndcg@3: 0.976431\tvalid_0's ndcg@4: 0.976743\tvalid_0's ndcg@5: 0.97683\n",
- "[42]\tvalid_0's ndcg@1: 0.939375\tvalid_0's ndcg@2: 0.974833\tvalid_0's ndcg@3: 0.976546\tvalid_0's ndcg@4: 0.976858\tvalid_0's ndcg@5: 0.976945\n",
- "[43]\tvalid_0's ndcg@1: 0.939625\tvalid_0's ndcg@2: 0.974878\tvalid_0's ndcg@3: 0.976628\tvalid_0's ndcg@4: 0.97694\tvalid_0's ndcg@5: 0.977027\n",
- "[44]\tvalid_0's ndcg@1: 0.9395\tvalid_0's ndcg@2: 0.974832\tvalid_0's ndcg@3: 0.97657\tvalid_0's ndcg@4: 0.976893\tvalid_0's ndcg@5: 0.97698\n",
- "[45]\tvalid_0's ndcg@1: 0.939775\tvalid_0's ndcg@2: 0.974949\tvalid_0's ndcg@3: 0.976674\tvalid_0's ndcg@4: 0.976997\tvalid_0's ndcg@5: 0.977084\n",
- "[46]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.974945\tvalid_0's ndcg@3: 0.976708\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977107\n",
- "[47]\tvalid_0's ndcg@1: 0.94005\tvalid_0's ndcg@2: 0.975004\tvalid_0's ndcg@3: 0.976766\tvalid_0's ndcg@4: 0.977078\tvalid_0's ndcg@5: 0.977175\n",
- "[48]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
- "[49]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
- "[50]\tvalid_0's ndcg@1: 0.9405\tvalid_0's ndcg@2: 0.975264\tvalid_0's ndcg@3: 0.976989\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
- "[51]\tvalid_0's ndcg@1: 0.941125\tvalid_0's ndcg@2: 0.975526\tvalid_0's ndcg@3: 0.977226\tvalid_0's ndcg@4: 0.977528\tvalid_0's ndcg@5: 0.977605\n",
- "[52]\tvalid_0's ndcg@1: 0.941\tvalid_0's ndcg@2: 0.97548\tvalid_0's ndcg@3: 0.977193\tvalid_0's ndcg@4: 0.977484\tvalid_0's ndcg@5: 0.977561\n",
- "[53]\tvalid_0's ndcg@1: 0.9411\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.977259\tvalid_0's ndcg@4: 0.977539\tvalid_0's ndcg@5: 0.977616\n",
- "[54]\tvalid_0's ndcg@1: 0.9412\tvalid_0's ndcg@2: 0.975712\tvalid_0's ndcg@3: 0.977299\tvalid_0's ndcg@4: 0.97759\tvalid_0's ndcg@5: 0.977667\n",
- "[55]\tvalid_0's ndcg@1: 0.94155\tvalid_0's ndcg@2: 0.975841\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977719\tvalid_0's ndcg@5: 0.977797\n",
- "[56]\tvalid_0's ndcg@1: 0.941825\tvalid_0's ndcg@2: 0.975943\tvalid_0's ndcg@3: 0.97753\tvalid_0's ndcg@4: 0.977821\tvalid_0's ndcg@5: 0.977898\n",
- "[57]\tvalid_0's ndcg@1: 0.9416\tvalid_0's ndcg@2: 0.975891\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977741\tvalid_0's ndcg@5: 0.977818\n",
- "[58]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977494\tvalid_0's ndcg@4: 0.977795\tvalid_0's ndcg@5: 0.977873\n",
- "[59]\tvalid_0's ndcg@1: 0.942025\tvalid_0's ndcg@2: 0.975985\tvalid_0's ndcg@3: 0.977547\tvalid_0's ndcg@4: 0.977881\tvalid_0's ndcg@5: 0.977958\n",
- "[60]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975994\tvalid_0's ndcg@3: 0.977569\tvalid_0's ndcg@4: 0.977892\tvalid_0's ndcg@5: 0.977969\n",
- "[61]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977559\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.97796\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取排序特征"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[62]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976027\tvalid_0's ndcg@3: 0.97764\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.978028\n",
- "[63]\tvalid_0's ndcg@1: 0.942125\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977622\tvalid_0's ndcg@4: 0.977912\tvalid_0's ndcg@5: 0.977999\n",
- "[64]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977793\tvalid_0's ndcg@4: 0.978105\tvalid_0's ndcg@5: 0.978192\n",
- "[65]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976227\tvalid_0's ndcg@3: 0.977802\tvalid_0's ndcg@4: 0.978125\tvalid_0's ndcg@5: 0.978212\n",
- "[66]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976132\tvalid_0's ndcg@3: 0.977695\tvalid_0's ndcg@4: 0.978018\tvalid_0's ndcg@5: 0.978105\n",
- "[67]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976092\tvalid_0's ndcg@3: 0.977679\tvalid_0's ndcg@4: 0.978002\tvalid_0's ndcg@5: 0.978089\n",
- "[68]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976148\tvalid_0's ndcg@3: 0.977698\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.978108\n",
- "[69]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976123\tvalid_0's ndcg@3: 0.977686\tvalid_0's ndcg@4: 0.978009\tvalid_0's ndcg@5: 0.978096\n",
- "[70]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976222\tvalid_0's ndcg@3: 0.977785\tvalid_0's ndcg@4: 0.978097\tvalid_0's ndcg@5: 0.978184\n",
- "[71]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976188\tvalid_0's ndcg@3: 0.977763\tvalid_0's ndcg@4: 0.978075\tvalid_0's ndcg@5: 0.978162\n",
- "[72]\tvalid_0's ndcg@1: 0.9427\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977809\tvalid_0's ndcg@4: 0.978121\tvalid_0's ndcg@5: 0.978208\n",
- "[73]\tvalid_0's ndcg@1: 0.9428\tvalid_0's ndcg@2: 0.976255\tvalid_0's ndcg@3: 0.977843\tvalid_0's ndcg@4: 0.978155\tvalid_0's ndcg@5: 0.978242\n",
- "[74]\tvalid_0's ndcg@1: 0.94295\tvalid_0's ndcg@2: 0.97631\tvalid_0's ndcg@3: 0.977898\tvalid_0's ndcg@4: 0.97821\tvalid_0's ndcg@5: 0.978297\n",
- "[75]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976329\tvalid_0's ndcg@3: 0.977941\tvalid_0's ndcg@4: 0.978232\tvalid_0's ndcg@5: 0.978319\n",
- "[76]\tvalid_0's ndcg@1: 0.9433\tvalid_0's ndcg@2: 0.976471\tvalid_0's ndcg@3: 0.978059\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978437\n",
- "[77]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976416\tvalid_0's ndcg@3: 0.977991\tvalid_0's ndcg@4: 0.978314\tvalid_0's ndcg@5: 0.978381\n",
- "[78]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976657\tvalid_0's ndcg@3: 0.978194\tvalid_0's ndcg@4: 0.978517\tvalid_0's ndcg@5: 0.978585\n",
- "[79]\tvalid_0's ndcg@1: 0.94365\tvalid_0's ndcg@2: 0.976663\tvalid_0's ndcg@3: 0.978188\tvalid_0's ndcg@4: 0.978501\tvalid_0's ndcg@5: 0.978578\n",
- "[80]\tvalid_0's ndcg@1: 0.943725\tvalid_0's ndcg@2: 0.976628\tvalid_0's ndcg@3: 0.978203\tvalid_0's ndcg@4: 0.978515\tvalid_0's ndcg@5: 0.978593\n",
- "[81]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97672\tvalid_0's ndcg@3: 0.978295\tvalid_0's ndcg@4: 0.978607\tvalid_0's ndcg@5: 0.978685\n",
- "[82]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978397\tvalid_0's ndcg@4: 0.97872\tvalid_0's ndcg@5: 0.978787\n",
- "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976788\tvalid_0's ndcg@3: 0.978375\tvalid_0's ndcg@4: 0.978698\tvalid_0's ndcg@5: 0.978766\n",
- "[84]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.97679\tvalid_0's ndcg@3: 0.97839\tvalid_0's ndcg@4: 0.978702\tvalid_0's ndcg@5: 0.97878\n",
- "[85]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.976809\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978723\tvalid_0's ndcg@5: 0.9788\n",
- "[86]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976939\tvalid_0's ndcg@3: 0.978502\tvalid_0's ndcg@4: 0.978814\tvalid_0's ndcg@5: 0.978891\n",
- "[87]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.976976\tvalid_0's ndcg@3: 0.978551\tvalid_0's ndcg@4: 0.978852\tvalid_0's ndcg@5: 0.97893\n",
- "[88]\tvalid_0's ndcg@1: 0.944925\tvalid_0's ndcg@2: 0.977102\tvalid_0's ndcg@3: 0.978677\tvalid_0's ndcg@4: 0.978968\tvalid_0's ndcg@5: 0.979045\n",
- "[89]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978758\tvalid_0's ndcg@4: 0.979048\tvalid_0's ndcg@5: 0.979126\n",
- "[90]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.978735\tvalid_0's ndcg@4: 0.979026\tvalid_0's ndcg@5: 0.979104\n",
- "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978858\tvalid_0's ndcg@4: 0.979138\tvalid_0's ndcg@5: 0.979215\n",
- "[92]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.977267\tvalid_0's ndcg@3: 0.978905\tvalid_0's ndcg@4: 0.979174\tvalid_0's ndcg@5: 0.979251\n",
- "[93]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977193\tvalid_0's ndcg@3: 0.978818\tvalid_0's ndcg@4: 0.979098\tvalid_0's ndcg@5: 0.979176\n",
- "[94]\tvalid_0's ndcg@1: 0.94545\tvalid_0's ndcg@2: 0.97728\tvalid_0's ndcg@3: 0.97888\tvalid_0's ndcg@4: 0.97916\tvalid_0's ndcg@5: 0.979238\n",
- "[95]\tvalid_0's ndcg@1: 0.9458\tvalid_0's ndcg@2: 0.977394\tvalid_0's ndcg@3: 0.979006\tvalid_0's ndcg@4: 0.979286\tvalid_0's ndcg@5: 0.979364\n",
- "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979114\tvalid_0's ndcg@4: 0.979394\tvalid_0's ndcg@5: 0.979472\n",
- "[97]\tvalid_0's ndcg@1: 0.946475\tvalid_0's ndcg@2: 0.977659\tvalid_0's ndcg@3: 0.979259\tvalid_0's ndcg@4: 0.979539\tvalid_0's ndcg@5: 0.979616\n",
- "[98]\tvalid_0's ndcg@1: 0.94675\tvalid_0's ndcg@2: 0.97776\tvalid_0's ndcg@3: 0.97936\tvalid_0's ndcg@4: 0.979651\tvalid_0's ndcg@5: 0.979719\n",
- "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
- "[100]\tvalid_0's ndcg@1: 0.9468\tvalid_0's ndcg@2: 0.977794\tvalid_0's ndcg@3: 0.979369\tvalid_0's ndcg@4: 0.979671\tvalid_0's ndcg@5: 0.979739\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
- "[1]\tvalid_0's ndcg@1: 0.909075\tvalid_0's ndcg@2: 0.963019\tvalid_0's ndcg@3: 0.965069\tvalid_0's ndcg@4: 0.965543\tvalid_0's ndcg@5: 0.965601\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9123\tvalid_0's ndcg@2: 0.964273\tvalid_0's ndcg@3: 0.966248\tvalid_0's ndcg@4: 0.966722\tvalid_0's ndcg@5: 0.966789\n",
- "[3]\tvalid_0's ndcg@1: 0.915075\tvalid_0's ndcg@2: 0.965691\tvalid_0's ndcg@3: 0.967466\tvalid_0's ndcg@4: 0.967854\tvalid_0's ndcg@5: 0.967922\n",
- "[4]\tvalid_0's ndcg@1: 0.91845\tvalid_0's ndcg@2: 0.967047\tvalid_0's ndcg@3: 0.968735\tvalid_0's ndcg@4: 0.969133\tvalid_0's ndcg@5: 0.969201\n",
- "[5]\tvalid_0's ndcg@1: 0.92355\tvalid_0's ndcg@2: 0.968961\tvalid_0's ndcg@3: 0.970674\tvalid_0's ndcg@4: 0.97104\tvalid_0's ndcg@5: 0.971098\n",
- "[6]\tvalid_0's ndcg@1: 0.9253\tvalid_0's ndcg@2: 0.969607\tvalid_0's ndcg@3: 0.971345\tvalid_0's ndcg@4: 0.971689\tvalid_0's ndcg@5: 0.971747\n",
- "[7]\tvalid_0's ndcg@1: 0.926225\tvalid_0's ndcg@2: 0.969933\tvalid_0's ndcg@3: 0.971708\tvalid_0's ndcg@4: 0.972031\tvalid_0's ndcg@5: 0.972079\n",
- "[8]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.970104\tvalid_0's ndcg@3: 0.971804\tvalid_0's ndcg@4: 0.972116\tvalid_0's ndcg@5: 0.972184\n",
- "[9]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970682\tvalid_0's ndcg@3: 0.972307\tvalid_0's ndcg@4: 0.972598\tvalid_0's ndcg@5: 0.972675\n",
- "[10]\tvalid_0's ndcg@1: 0.92775\tvalid_0's ndcg@2: 0.970653\tvalid_0's ndcg@3: 0.972316\tvalid_0's ndcg@4: 0.972617\tvalid_0's ndcg@5: 0.972685\n",
- "[11]\tvalid_0's ndcg@1: 0.9283\tvalid_0's ndcg@2: 0.97084\tvalid_0's ndcg@3: 0.97254\tvalid_0's ndcg@4: 0.97281\tvalid_0's ndcg@5: 0.972887\n",
- "[12]\tvalid_0's ndcg@1: 0.9287\tvalid_0's ndcg@2: 0.971051\tvalid_0's ndcg@3: 0.972701\tvalid_0's ndcg@4: 0.97297\tvalid_0's ndcg@5: 0.973048\n",
- "[13]\tvalid_0's ndcg@1: 0.9297\tvalid_0's ndcg@2: 0.971389\tvalid_0's ndcg@3: 0.973001\tvalid_0's ndcg@4: 0.973313\tvalid_0's ndcg@5: 0.9734\n",
- "[14]\tvalid_0's ndcg@1: 0.92955\tvalid_0's ndcg@2: 0.971444\tvalid_0's ndcg@3: 0.972994\tvalid_0's ndcg@4: 0.973284\tvalid_0's ndcg@5: 0.973371\n",
- "[15]\tvalid_0's ndcg@1: 0.930225\tvalid_0's ndcg@2: 0.97174\tvalid_0's ndcg@3: 0.973253\tvalid_0's ndcg@4: 0.973543\tvalid_0's ndcg@5: 0.97363\n",
- "[16]\tvalid_0's ndcg@1: 0.930425\tvalid_0's ndcg@2: 0.971798\tvalid_0's ndcg@3: 0.973298\tvalid_0's ndcg@4: 0.97361\tvalid_0's ndcg@5: 0.973698\n",
- "[17]\tvalid_0's ndcg@1: 0.93125\tvalid_0's ndcg@2: 0.971992\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973903\tvalid_0's ndcg@5: 0.97398\n",
- "[18]\tvalid_0's ndcg@1: 0.931925\tvalid_0's ndcg@2: 0.972257\tvalid_0's ndcg@3: 0.973845\tvalid_0's ndcg@4: 0.974146\tvalid_0's ndcg@5: 0.974224\n",
- "[19]\tvalid_0's ndcg@1: 0.932375\tvalid_0's ndcg@2: 0.972376\tvalid_0's ndcg@3: 0.974038\tvalid_0's ndcg@4: 0.974318\tvalid_0's ndcg@5: 0.974376\n",
- "[20]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.972269\tvalid_0's ndcg@3: 0.973907\tvalid_0's ndcg@4: 0.974187\tvalid_0's ndcg@5: 0.974245\n",
- "[21]\tvalid_0's ndcg@1: 0.932725\tvalid_0's ndcg@2: 0.972568\tvalid_0's ndcg@3: 0.974181\tvalid_0's ndcg@4: 0.974471\tvalid_0's ndcg@5: 0.974529\n",
- "[22]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972735\tvalid_0's ndcg@3: 0.974298\tvalid_0's ndcg@4: 0.974599\tvalid_0's ndcg@5: 0.974657\n",
- "[23]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972642\tvalid_0's ndcg@3: 0.974255\tvalid_0's ndcg@4: 0.974545\tvalid_0's ndcg@5: 0.974594\n",
- "[24]\tvalid_0's ndcg@1: 0.933175\tvalid_0's ndcg@2: 0.972734\tvalid_0's ndcg@3: 0.974347\tvalid_0's ndcg@4: 0.974638\tvalid_0's ndcg@5: 0.974686\n",
- "[25]\tvalid_0's ndcg@1: 0.9331\tvalid_0's ndcg@2: 0.972754\tvalid_0's ndcg@3: 0.974366\tvalid_0's ndcg@4: 0.974636\tvalid_0's ndcg@5: 0.974674\n"
- ]
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:20:41.843180Z",
+ "start_time": "2020-11-18T04:20:41.837287Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data_path = './data_raw/'\n",
+ "save_path = './temp_results/'\n",
+ "offline = False"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[26]\tvalid_0's ndcg@1: 0.933275\tvalid_0's ndcg@2: 0.972787\tvalid_0's ndcg@3: 0.974424\tvalid_0's ndcg@4: 0.974694\tvalid_0's ndcg@5: 0.974732\n",
- "[27]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972809\tvalid_0's ndcg@3: 0.974434\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.974732\n",
- "[28]\tvalid_0's ndcg@1: 0.933625\tvalid_0's ndcg@2: 0.972932\tvalid_0's ndcg@3: 0.974557\tvalid_0's ndcg@4: 0.974826\tvalid_0's ndcg@5: 0.974855\n",
- "[29]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972937\tvalid_0's ndcg@3: 0.974587\tvalid_0's ndcg@4: 0.974856\tvalid_0's ndcg@5: 0.974885\n",
- "[30]\tvalid_0's ndcg@1: 0.93355\tvalid_0's ndcg@2: 0.972873\tvalid_0's ndcg@3: 0.974523\tvalid_0's ndcg@4: 0.974792\tvalid_0's ndcg@5: 0.974821\n",
- "[31]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973065\tvalid_0's ndcg@3: 0.974753\tvalid_0's ndcg@4: 0.975022\tvalid_0's ndcg@5: 0.975051\n",
- "[32]\tvalid_0's ndcg@1: 0.93435\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974815\tvalid_0's ndcg@4: 0.975084\tvalid_0's ndcg@5: 0.975113\n",
- "[33]\tvalid_0's ndcg@1: 0.934475\tvalid_0's ndcg@2: 0.97323\tvalid_0's ndcg@3: 0.974855\tvalid_0's ndcg@4: 0.975135\tvalid_0's ndcg@5: 0.975164\n",
- "[34]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973113\tvalid_0's ndcg@3: 0.974738\tvalid_0's ndcg@4: 0.975028\tvalid_0's ndcg@5: 0.975057\n",
- "[35]\tvalid_0's ndcg@1: 0.93455\tvalid_0's ndcg@2: 0.973258\tvalid_0's ndcg@3: 0.97487\tvalid_0's ndcg@4: 0.975172\tvalid_0's ndcg@5: 0.975201\n",
- "[36]\tvalid_0's ndcg@1: 0.9344\tvalid_0's ndcg@2: 0.973265\tvalid_0's ndcg@3: 0.974828\tvalid_0's ndcg@4: 0.975129\tvalid_0's ndcg@5: 0.975158\n",
- "[37]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973438\tvalid_0's ndcg@3: 0.975013\tvalid_0's ndcg@4: 0.975304\tvalid_0's ndcg@5: 0.975323\n",
- "[38]\tvalid_0's ndcg@1: 0.934975\tvalid_0's ndcg@2: 0.973541\tvalid_0's ndcg@3: 0.975066\tvalid_0's ndcg@4: 0.975367\tvalid_0's ndcg@5: 0.975386\n",
- "[39]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973667\tvalid_0's ndcg@3: 0.975192\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975502\n",
- "[40]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973624\tvalid_0's ndcg@3: 0.975174\tvalid_0's ndcg@4: 0.975454\tvalid_0's ndcg@5: 0.975473\n",
- "[41]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973686\tvalid_0's ndcg@3: 0.975223\tvalid_0's ndcg@4: 0.975503\tvalid_0's ndcg@5: 0.975522\n",
- "[42]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973716\tvalid_0's ndcg@3: 0.975266\tvalid_0's ndcg@4: 0.975546\tvalid_0's ndcg@5: 0.975565\n",
- "[43]\tvalid_0's ndcg@1: 0.93615\tvalid_0's ndcg@2: 0.974022\tvalid_0's ndcg@3: 0.975534\tvalid_0's ndcg@4: 0.975814\tvalid_0's ndcg@5: 0.975843\n",
- "[44]\tvalid_0's ndcg@1: 0.936225\tvalid_0's ndcg@2: 0.974112\tvalid_0's ndcg@3: 0.975562\tvalid_0's ndcg@4: 0.975853\tvalid_0's ndcg@5: 0.975882\n",
- "[45]\tvalid_0's ndcg@1: 0.9365\tvalid_0's ndcg@2: 0.974167\tvalid_0's ndcg@3: 0.975654\tvalid_0's ndcg@4: 0.975945\tvalid_0's ndcg@5: 0.975974\n",
- "[46]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974206\tvalid_0's ndcg@3: 0.975694\tvalid_0's ndcg@4: 0.975995\tvalid_0's ndcg@5: 0.976024\n",
- "[47]\tvalid_0's ndcg@1: 0.93685\tvalid_0's ndcg@2: 0.974311\tvalid_0's ndcg@3: 0.975786\tvalid_0's ndcg@4: 0.976077\tvalid_0's ndcg@5: 0.976106\n",
- "[48]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974408\tvalid_0's ndcg@3: 0.975845\tvalid_0's ndcg@4: 0.976147\tvalid_0's ndcg@5: 0.976185\n",
- "[49]\tvalid_0's ndcg@1: 0.936975\tvalid_0's ndcg@2: 0.974342\tvalid_0's ndcg@3: 0.975829\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.976159\n",
- "[50]\tvalid_0's ndcg@1: 0.9371\tvalid_0's ndcg@2: 0.974388\tvalid_0's ndcg@3: 0.97585\tvalid_0's ndcg@4: 0.976152\tvalid_0's ndcg@5: 0.976191\n",
- "[51]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974329\tvalid_0's ndcg@3: 0.975841\tvalid_0's ndcg@4: 0.976121\tvalid_0's ndcg@5: 0.97616\n",
- "[52]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974578\tvalid_0's ndcg@3: 0.976078\tvalid_0's ndcg@4: 0.976369\tvalid_0's ndcg@5: 0.976407\n",
- "[53]\tvalid_0's ndcg@1: 0.9378\tvalid_0's ndcg@2: 0.974615\tvalid_0's ndcg@3: 0.976115\tvalid_0's ndcg@4: 0.976405\tvalid_0's ndcg@5: 0.976444\n",
- "[54]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974689\tvalid_0's ndcg@3: 0.976214\tvalid_0's ndcg@4: 0.976483\tvalid_0's ndcg@5: 0.976521\n",
- "[55]\tvalid_0's ndcg@1: 0.938225\tvalid_0's ndcg@2: 0.974803\tvalid_0's ndcg@3: 0.976303\tvalid_0's ndcg@4: 0.976572\tvalid_0's ndcg@5: 0.976611\n",
- "[56]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.9748\tvalid_0's ndcg@3: 0.976275\tvalid_0's ndcg@4: 0.976555\tvalid_0's ndcg@5: 0.976594\n",
- "[57]\tvalid_0's ndcg@1: 0.938525\tvalid_0's ndcg@2: 0.974914\tvalid_0's ndcg@3: 0.976414\tvalid_0's ndcg@4: 0.976683\tvalid_0's ndcg@5: 0.976722\n",
- "[58]\tvalid_0's ndcg@1: 0.93875\tvalid_0's ndcg@2: 0.975028\tvalid_0's ndcg@3: 0.976503\tvalid_0's ndcg@4: 0.976773\tvalid_0's ndcg@5: 0.976811\n",
- "[59]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975198\tvalid_0's ndcg@3: 0.976648\tvalid_0's ndcg@4: 0.976918\tvalid_0's ndcg@5: 0.976956\n",
- "[60]\tvalid_0's ndcg@1: 0.939025\tvalid_0's ndcg@2: 0.975177\tvalid_0's ndcg@3: 0.976615\tvalid_0's ndcg@4: 0.976884\tvalid_0's ndcg@5: 0.976923\n",
- "[61]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975205\tvalid_0's ndcg@3: 0.976642\tvalid_0's ndcg@4: 0.976912\tvalid_0's ndcg@5: 0.97695\n",
- "[62]\tvalid_0's ndcg@1: 0.93965\tvalid_0's ndcg@2: 0.975424\tvalid_0's ndcg@3: 0.976836\tvalid_0's ndcg@4: 0.977116\tvalid_0's ndcg@5: 0.977155\n",
- "[63]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.976996\tvalid_0's ndcg@4: 0.977276\tvalid_0's ndcg@5: 0.977315\n",
- "[64]\tvalid_0's ndcg@1: 0.940375\tvalid_0's ndcg@2: 0.975723\tvalid_0's ndcg@3: 0.977123\tvalid_0's ndcg@4: 0.977392\tvalid_0's ndcg@5: 0.977431\n",
- "[65]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977154\tvalid_0's ndcg@4: 0.977423\tvalid_0's ndcg@5: 0.977462\n",
- "[66]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975744\tvalid_0's ndcg@3: 0.977156\tvalid_0's ndcg@4: 0.977426\tvalid_0's ndcg@5: 0.977464\n",
- "[67]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.97576\tvalid_0's ndcg@3: 0.977172\tvalid_0's ndcg@4: 0.977431\tvalid_0's ndcg@5: 0.977469\n",
- "[68]\tvalid_0's ndcg@1: 0.940675\tvalid_0's ndcg@2: 0.975849\tvalid_0's ndcg@3: 0.977249\tvalid_0's ndcg@4: 0.977508\tvalid_0's ndcg@5: 0.977546\n",
- "[69]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.976017\tvalid_0's ndcg@3: 0.977454\tvalid_0's ndcg@4: 0.977724\tvalid_0's ndcg@5: 0.977762\n",
- "[70]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.977362\tvalid_0's ndcg@4: 0.977631\tvalid_0's ndcg@5: 0.97767\n",
- "[71]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.97735\tvalid_0's ndcg@4: 0.97763\tvalid_0's ndcg@5: 0.977668\n",
- "[72]\tvalid_0's ndcg@1: 0.941325\tvalid_0's ndcg@2: 0.976058\tvalid_0's ndcg@3: 0.97747\tvalid_0's ndcg@4: 0.977739\tvalid_0's ndcg@5: 0.977778\n",
- "[73]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977476\tvalid_0's ndcg@4: 0.977756\tvalid_0's ndcg@5: 0.977795\n",
- "[74]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.97759\tvalid_0's ndcg@4: 0.97788\tvalid_0's ndcg@5: 0.977919\n",
- "[75]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.977602\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.977921\n",
- "[76]\tvalid_0's ndcg@1: 0.94195\tvalid_0's ndcg@2: 0.976273\tvalid_0's ndcg@3: 0.977685\tvalid_0's ndcg@4: 0.977965\tvalid_0's ndcg@5: 0.978004\n",
- "[77]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.97627\tvalid_0's ndcg@3: 0.97767\tvalid_0's ndcg@4: 0.97795\tvalid_0's ndcg@5: 0.977989\n",
- "[78]\tvalid_0's ndcg@1: 0.94235\tvalid_0's ndcg@2: 0.976452\tvalid_0's ndcg@3: 0.977839\tvalid_0's ndcg@4: 0.978119\tvalid_0's ndcg@5: 0.978158\n",
- "[79]\tvalid_0's ndcg@1: 0.94265\tvalid_0's ndcg@2: 0.976562\tvalid_0's ndcg@3: 0.977937\tvalid_0's ndcg@4: 0.978228\tvalid_0's ndcg@5: 0.978267\n",
- "[80]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976667\tvalid_0's ndcg@3: 0.978067\tvalid_0's ndcg@4: 0.978347\tvalid_0's ndcg@5: 0.978385\n",
- "[81]\tvalid_0's ndcg@1: 0.94305\tvalid_0's ndcg@2: 0.97671\tvalid_0's ndcg@3: 0.978098\tvalid_0's ndcg@4: 0.978378\tvalid_0's ndcg@5: 0.978416\n",
- "[82]\tvalid_0's ndcg@1: 0.943175\tvalid_0's ndcg@2: 0.97674\tvalid_0's ndcg@3: 0.978115\tvalid_0's ndcg@4: 0.978417\tvalid_0's ndcg@5: 0.978456\n",
- "[83]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976752\tvalid_0's ndcg@3: 0.97814\tvalid_0's ndcg@4: 0.978441\tvalid_0's ndcg@5: 0.97848\n",
- "[84]\tvalid_0's ndcg@1: 0.943375\tvalid_0's ndcg@2: 0.976767\tvalid_0's ndcg@3: 0.978179\tvalid_0's ndcg@4: 0.978481\tvalid_0's ndcg@5: 0.97852\n",
- "[85]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976721\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978437\tvalid_0's ndcg@5: 0.978475\n",
- "[86]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976792\tvalid_0's ndcg@3: 0.978204\tvalid_0's ndcg@4: 0.978506\tvalid_0's ndcg@5: 0.978535\n",
- "[87]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.976851\tvalid_0's ndcg@3: 0.978239\tvalid_0's ndcg@4: 0.97854\tvalid_0's ndcg@5: 0.978569\n",
- "[88]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976882\tvalid_0's ndcg@3: 0.978282\tvalid_0's ndcg@4: 0.978572\tvalid_0's ndcg@5: 0.978611\n",
- "[89]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.976915\tvalid_0's ndcg@3: 0.97834\tvalid_0's ndcg@4: 0.97863\tvalid_0's ndcg@5: 0.978669\n"
- ]
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:20:53.358138Z",
+ "start_time": "2020-11-18T04:20:44.232944Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 重新读取数据的时候,发现click_article_id是一个浮点数,所以将其转换成int类型\n",
+ "trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
+ "trn_user_item_feats_df['click_article_id'] = trn_user_item_feats_df['click_article_id'].astype(int)\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
+ " val_user_item_feats_df['click_article_id'] = val_user_item_feats_df['click_article_id'].astype(int)\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
+ "tst_user_item_feats_df['click_article_id'] = tst_user_item_feats_df['click_article_id'].astype(int)\n",
+ "\n",
+ "# 做特征的时候为了方便,给测试集也打上了一个无效的标签,这里直接删掉就行\n",
+ "del tst_user_item_feats_df['label']"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[90]\tvalid_0's ndcg@1: 0.943925\tvalid_0's ndcg@2: 0.976986\tvalid_0's ndcg@3: 0.978398\tvalid_0's ndcg@4: 0.978689\tvalid_0's ndcg@5: 0.978728\n",
- "[91]\tvalid_0's ndcg@1: 0.943875\tvalid_0's ndcg@2: 0.976999\tvalid_0's ndcg@3: 0.978399\tvalid_0's ndcg@4: 0.978679\tvalid_0's ndcg@5: 0.978717\n",
- "[92]\tvalid_0's ndcg@1: 0.94395\tvalid_0's ndcg@2: 0.977058\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978711\tvalid_0's ndcg@5: 0.97876\n",
- "[93]\tvalid_0's ndcg@1: 0.944075\tvalid_0's ndcg@2: 0.977104\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978759\tvalid_0's ndcg@5: 0.978807\n",
- "[94]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977125\tvalid_0's ndcg@3: 0.978513\tvalid_0's ndcg@4: 0.978793\tvalid_0's ndcg@5: 0.978841\n",
- "[95]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977153\tvalid_0's ndcg@3: 0.97854\tvalid_0's ndcg@4: 0.97882\tvalid_0's ndcg@5: 0.978869\n",
- "[96]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977144\tvalid_0's ndcg@3: 0.978531\tvalid_0's ndcg@4: 0.978811\tvalid_0's ndcg@5: 0.97886\n",
- "[97]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977221\tvalid_0's ndcg@3: 0.978584\tvalid_0's ndcg@4: 0.978864\tvalid_0's ndcg@5: 0.978912\n",
- "[98]\tvalid_0's ndcg@1: 0.944575\tvalid_0's ndcg@2: 0.977289\tvalid_0's ndcg@3: 0.978651\tvalid_0's ndcg@4: 0.978942\tvalid_0's ndcg@5: 0.97899\n",
- "[99]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977341\tvalid_0's ndcg@3: 0.978691\tvalid_0's ndcg@4: 0.978993\tvalid_0's ndcg@5: 0.979032\n",
- "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
- "[1]\tvalid_0's ndcg@1: 0.911575\tvalid_0's ndcg@2: 0.964384\tvalid_0's ndcg@3: 0.966321\tvalid_0's ndcg@4: 0.966623\tvalid_0's ndcg@5: 0.966671\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9136\tvalid_0's ndcg@2: 0.965257\tvalid_0's ndcg@3: 0.967107\tvalid_0's ndcg@4: 0.967398\tvalid_0's ndcg@5: 0.967456\n",
- "[3]\tvalid_0's ndcg@1: 0.917425\tvalid_0's ndcg@2: 0.966732\tvalid_0's ndcg@3: 0.968545\tvalid_0's ndcg@4: 0.968814\tvalid_0's ndcg@5: 0.968882\n",
- "[4]\tvalid_0's ndcg@1: 0.9222\tvalid_0's ndcg@2: 0.968558\tvalid_0's ndcg@3: 0.970383\tvalid_0's ndcg@4: 0.970619\tvalid_0's ndcg@5: 0.970668\n",
- "[5]\tvalid_0's ndcg@1: 0.925875\tvalid_0's ndcg@2: 0.969914\tvalid_0's ndcg@3: 0.971714\tvalid_0's ndcg@4: 0.971972\tvalid_0's ndcg@5: 0.972021\n",
- "[6]\tvalid_0's ndcg@1: 0.926875\tvalid_0's ndcg@2: 0.970425\tvalid_0's ndcg@3: 0.972112\tvalid_0's ndcg@4: 0.972371\tvalid_0's ndcg@5: 0.972419\n",
- "[7]\tvalid_0's ndcg@1: 0.927475\tvalid_0's ndcg@2: 0.970631\tvalid_0's ndcg@3: 0.972306\tvalid_0's ndcg@4: 0.972586\tvalid_0's ndcg@5: 0.972634\n",
- "[8]\tvalid_0's ndcg@1: 0.93015\tvalid_0's ndcg@2: 0.971649\tvalid_0's ndcg@3: 0.973287\tvalid_0's ndcg@4: 0.973567\tvalid_0's ndcg@5: 0.973625\n",
- "[9]\tvalid_0's ndcg@1: 0.9312\tvalid_0's ndcg@2: 0.972084\tvalid_0's ndcg@3: 0.973684\tvalid_0's ndcg@4: 0.973964\tvalid_0's ndcg@5: 0.974022\n",
- "[10]\tvalid_0's ndcg@1: 0.93225\tvalid_0's ndcg@2: 0.972456\tvalid_0's ndcg@3: 0.974081\tvalid_0's ndcg@4: 0.974361\tvalid_0's ndcg@5: 0.974409\n",
- "[11]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972704\tvalid_0's ndcg@3: 0.974379\tvalid_0's ndcg@4: 0.974648\tvalid_0's ndcg@5: 0.974696\n",
- "[12]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974574\tvalid_0's ndcg@4: 0.974832\tvalid_0's ndcg@5: 0.974881\n",
- "[13]\tvalid_0's ndcg@1: 0.93415\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.97482\tvalid_0's ndcg@4: 0.975079\tvalid_0's ndcg@5: 0.975127\n",
- "[14]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973671\tvalid_0's ndcg@3: 0.975246\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975531\n",
- "[15]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.973877\tvalid_0's ndcg@3: 0.975452\tvalid_0's ndcg@4: 0.975699\tvalid_0's ndcg@5: 0.975748\n",
- "[16]\tvalid_0's ndcg@1: 0.935825\tvalid_0's ndcg@2: 0.973917\tvalid_0's ndcg@3: 0.975442\tvalid_0's ndcg@4: 0.975712\tvalid_0's ndcg@5: 0.97576\n",
- "[17]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.97411\tvalid_0's ndcg@3: 0.975697\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975995\n",
- "[18]\tvalid_0's ndcg@1: 0.936925\tvalid_0's ndcg@2: 0.974292\tvalid_0's ndcg@3: 0.975867\tvalid_0's ndcg@4: 0.976114\tvalid_0's ndcg@5: 0.976163\n",
- "[19]\tvalid_0's ndcg@1: 0.937525\tvalid_0's ndcg@2: 0.974545\tvalid_0's ndcg@3: 0.976095\tvalid_0's ndcg@4: 0.976342\tvalid_0's ndcg@5: 0.976391\n",
- "[20]\tvalid_0's ndcg@1: 0.937775\tvalid_0's ndcg@2: 0.974653\tvalid_0's ndcg@3: 0.976203\tvalid_0's ndcg@4: 0.976429\tvalid_0's ndcg@5: 0.976487\n",
- "[21]\tvalid_0's ndcg@1: 0.938825\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976597\tvalid_0's ndcg@4: 0.976823\tvalid_0's ndcg@5: 0.976881\n",
- "[22]\tvalid_0's ndcg@1: 0.93885\tvalid_0's ndcg@2: 0.975097\tvalid_0's ndcg@3: 0.976609\tvalid_0's ndcg@4: 0.976846\tvalid_0's ndcg@5: 0.976895\n",
- "[23]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976733\tvalid_0's ndcg@4: 0.976959\tvalid_0's ndcg@5: 0.977008\n",
- "[24]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976721\tvalid_0's ndcg@4: 0.976947\tvalid_0's ndcg@5: 0.977005\n",
- "[25]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975421\tvalid_0's ndcg@3: 0.976909\tvalid_0's ndcg@4: 0.977124\tvalid_0's ndcg@5: 0.977182\n",
- "[26]\tvalid_0's ndcg@1: 0.9393\tvalid_0's ndcg@2: 0.975342\tvalid_0's ndcg@3: 0.976804\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977078\n",
- "[27]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975323\tvalid_0's ndcg@3: 0.976798\tvalid_0's ndcg@4: 0.977014\tvalid_0's ndcg@5: 0.977062\n",
- "[28]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975308\tvalid_0's ndcg@3: 0.976783\tvalid_0's ndcg@4: 0.977009\tvalid_0's ndcg@5: 0.977057\n",
- "[29]\tvalid_0's ndcg@1: 0.94\tvalid_0's ndcg@2: 0.975569\tvalid_0's ndcg@3: 0.977056\tvalid_0's ndcg@4: 0.977282\tvalid_0's ndcg@5: 0.977331\n",
- "[30]\tvalid_0's ndcg@1: 0.940325\tvalid_0's ndcg@2: 0.975673\tvalid_0's ndcg@3: 0.977173\tvalid_0's ndcg@4: 0.977399\tvalid_0's ndcg@5: 0.977447\n",
- "[31]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975731\tvalid_0's ndcg@3: 0.977243\tvalid_0's ndcg@4: 0.977469\tvalid_0's ndcg@5: 0.977518\n",
- "[32]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
- "[33]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977241\tvalid_0's ndcg@4: 0.977457\tvalid_0's ndcg@5: 0.977505\n",
- "[34]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
- "[35]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975868\tvalid_0's ndcg@3: 0.977343\tvalid_0's ndcg@4: 0.977558\tvalid_0's ndcg@5: 0.977606\n",
- "[36]\tvalid_0's ndcg@1: 0.94115\tvalid_0's ndcg@2: 0.976056\tvalid_0's ndcg@3: 0.977506\tvalid_0's ndcg@4: 0.977722\tvalid_0's ndcg@5: 0.97777\n",
- "[37]\tvalid_0's ndcg@1: 0.9414\tvalid_0's ndcg@2: 0.976133\tvalid_0's ndcg@3: 0.977595\tvalid_0's ndcg@4: 0.977811\tvalid_0's ndcg@5: 0.977859\n",
- "[38]\tvalid_0's ndcg@1: 0.94175\tvalid_0's ndcg@2: 0.976278\tvalid_0's ndcg@3: 0.977715\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.97799\n",
- "[39]\tvalid_0's ndcg@1: 0.942075\tvalid_0's ndcg@2: 0.976366\tvalid_0's ndcg@3: 0.977841\tvalid_0's ndcg@4: 0.978056\tvalid_0's ndcg@5: 0.978105\n",
- "[40]\tvalid_0's ndcg@1: 0.94215\tvalid_0's ndcg@2: 0.976409\tvalid_0's ndcg@3: 0.977872\tvalid_0's ndcg@4: 0.978087\tvalid_0's ndcg@5: 0.978136\n",
- "[41]\tvalid_0's ndcg@1: 0.94245\tvalid_0's ndcg@2: 0.97652\tvalid_0's ndcg@3: 0.977983\tvalid_0's ndcg@4: 0.978198\tvalid_0's ndcg@5: 0.978246\n",
- "[42]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
- "[43]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
- "[44]\tvalid_0's ndcg@1: 0.94285\tvalid_0's ndcg@2: 0.976636\tvalid_0's ndcg@3: 0.978111\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978386\n",
- "[45]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.9768\tvalid_0's ndcg@3: 0.978262\tvalid_0's ndcg@4: 0.978488\tvalid_0's ndcg@5: 0.978537\n",
- "[46]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
- "[47]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97836\tvalid_0's ndcg@4: 0.978576\tvalid_0's ndcg@5: 0.978634\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 返回排序后的结果"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[48]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.978373\tvalid_0's ndcg@4: 0.978577\tvalid_0's ndcg@5: 0.978636\n",
- "[49]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
- "[50]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97702\tvalid_0's ndcg@3: 0.97852\tvalid_0's ndcg@4: 0.978746\tvalid_0's ndcg@5: 0.978794\n",
- "[51]\tvalid_0's ndcg@1: 0.9441\tvalid_0's ndcg@2: 0.97705\tvalid_0's ndcg@3: 0.97855\tvalid_0's ndcg@4: 0.978787\tvalid_0's ndcg@5: 0.978836\n",
- "[52]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977121\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978846\tvalid_0's ndcg@5: 0.978894\n",
- "[53]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977081\tvalid_0's ndcg@3: 0.978618\tvalid_0's ndcg@4: 0.978834\tvalid_0's ndcg@5: 0.978882\n",
- "[54]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977071\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978824\tvalid_0's ndcg@5: 0.978873\n",
- "[55]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977143\tvalid_0's ndcg@3: 0.978668\tvalid_0's ndcg@4: 0.978883\tvalid_0's ndcg@5: 0.978931\n",
- "[56]\tvalid_0's ndcg@1: 0.9444\tvalid_0's ndcg@2: 0.977177\tvalid_0's ndcg@3: 0.978702\tvalid_0's ndcg@4: 0.978906\tvalid_0's ndcg@5: 0.978955\n",
- "[57]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977263\tvalid_0's ndcg@3: 0.978788\tvalid_0's ndcg@4: 0.979003\tvalid_0's ndcg@5: 0.979051\n",
- "[58]\tvalid_0's ndcg@1: 0.9448\tvalid_0's ndcg@2: 0.977293\tvalid_0's ndcg@3: 0.978843\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979096\n",
- "[59]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977472\tvalid_0's ndcg@3: 0.978997\tvalid_0's ndcg@4: 0.979202\tvalid_0's ndcg@5: 0.97925\n",
- "[60]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.97763\tvalid_0's ndcg@3: 0.979118\tvalid_0's ndcg@4: 0.979322\tvalid_0's ndcg@5: 0.979371\n",
- "[61]\tvalid_0's ndcg@1: 0.945725\tvalid_0's ndcg@2: 0.977682\tvalid_0's ndcg@3: 0.979194\tvalid_0's ndcg@4: 0.979399\tvalid_0's ndcg@5: 0.979447\n",
- "[62]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977812\tvalid_0's ndcg@3: 0.979312\tvalid_0's ndcg@4: 0.979495\tvalid_0's ndcg@5: 0.979543\n",
- "[63]\tvalid_0's ndcg@1: 0.946\tvalid_0's ndcg@2: 0.977878\tvalid_0's ndcg@3: 0.97934\tvalid_0's ndcg@4: 0.979523\tvalid_0's ndcg@5: 0.979572\n",
- "[64]\tvalid_0's ndcg@1: 0.946525\tvalid_0's ndcg@2: 0.978056\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979714\tvalid_0's ndcg@5: 0.979762\n",
- "[65]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.978105\tvalid_0's ndcg@3: 0.979592\tvalid_0's ndcg@4: 0.979775\tvalid_0's ndcg@5: 0.979823\n",
- "[66]\tvalid_0's ndcg@1: 0.9465\tvalid_0's ndcg@2: 0.978046\tvalid_0's ndcg@3: 0.979534\tvalid_0's ndcg@4: 0.979706\tvalid_0's ndcg@5: 0.979755\n",
- "[67]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.978127\tvalid_0's ndcg@3: 0.979614\tvalid_0's ndcg@4: 0.979776\tvalid_0's ndcg@5: 0.979824\n",
- "[68]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.979608\tvalid_0's ndcg@4: 0.97978\tvalid_0's ndcg@5: 0.979828\n",
- "[69]\tvalid_0's ndcg@1: 0.946875\tvalid_0's ndcg@2: 0.978216\tvalid_0's ndcg@3: 0.979679\tvalid_0's ndcg@4: 0.979851\tvalid_0's ndcg@5: 0.9799\n",
- "[70]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.978194\tvalid_0's ndcg@3: 0.979682\tvalid_0's ndcg@4: 0.979854\tvalid_0's ndcg@5: 0.979902\n",
- "[71]\tvalid_0's ndcg@1: 0.947025\tvalid_0's ndcg@2: 0.978209\tvalid_0's ndcg@3: 0.979721\tvalid_0's ndcg@4: 0.979893\tvalid_0's ndcg@5: 0.979942\n",
- "[72]\tvalid_0's ndcg@1: 0.9472\tvalid_0's ndcg@2: 0.978273\tvalid_0's ndcg@3: 0.979773\tvalid_0's ndcg@4: 0.979956\tvalid_0's ndcg@5: 0.980005\n",
- "[73]\tvalid_0's ndcg@1: 0.947475\tvalid_0's ndcg@2: 0.978391\tvalid_0's ndcg@3: 0.979878\tvalid_0's ndcg@4: 0.980061\tvalid_0's ndcg@5: 0.980109\n",
- "[74]\tvalid_0's ndcg@1: 0.94715\tvalid_0's ndcg@2: 0.978271\tvalid_0's ndcg@3: 0.979758\tvalid_0's ndcg@4: 0.979941\tvalid_0's ndcg@5: 0.97999\n",
- "[75]\tvalid_0's ndcg@1: 0.947275\tvalid_0's ndcg@2: 0.978333\tvalid_0's ndcg@3: 0.979808\tvalid_0's ndcg@4: 0.979991\tvalid_0's ndcg@5: 0.980039\n",
- "[76]\tvalid_0's ndcg@1: 0.9474\tvalid_0's ndcg@2: 0.97841\tvalid_0's ndcg@3: 0.979873\tvalid_0's ndcg@4: 0.980045\tvalid_0's ndcg@5: 0.980093\n",
- "[77]\tvalid_0's ndcg@1: 0.94745\tvalid_0's ndcg@2: 0.97846\tvalid_0's ndcg@3: 0.979898\tvalid_0's ndcg@4: 0.98007\tvalid_0's ndcg@5: 0.980118\n",
- "[78]\tvalid_0's ndcg@1: 0.94775\tvalid_0's ndcg@2: 0.978555\tvalid_0's ndcg@3: 0.980005\tvalid_0's ndcg@4: 0.980177\tvalid_0's ndcg@5: 0.980226\n",
- "[79]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
- "[80]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
- "[81]\tvalid_0's ndcg@1: 0.948175\tvalid_0's ndcg@2: 0.978744\tvalid_0's ndcg@3: 0.980169\tvalid_0's ndcg@4: 0.980352\tvalid_0's ndcg@5: 0.98039\n",
- "[82]\tvalid_0's ndcg@1: 0.948375\tvalid_0's ndcg@2: 0.97888\tvalid_0's ndcg@3: 0.980255\tvalid_0's ndcg@4: 0.980438\tvalid_0's ndcg@5: 0.980477\n",
- "[83]\tvalid_0's ndcg@1: 0.94825\tvalid_0's ndcg@2: 0.978834\tvalid_0's ndcg@3: 0.980209\tvalid_0's ndcg@4: 0.980392\tvalid_0's ndcg@5: 0.980431\n",
- "[84]\tvalid_0's ndcg@1: 0.948275\tvalid_0's ndcg@2: 0.978844\tvalid_0's ndcg@3: 0.980219\tvalid_0's ndcg@4: 0.980402\tvalid_0's ndcg@5: 0.98044\n",
- "[85]\tvalid_0's ndcg@1: 0.948475\tvalid_0's ndcg@2: 0.978917\tvalid_0's ndcg@3: 0.980292\tvalid_0's ndcg@4: 0.980475\tvalid_0's ndcg@5: 0.980514\n",
- "[86]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979102\tvalid_0's ndcg@3: 0.980477\tvalid_0's ndcg@4: 0.98066\tvalid_0's ndcg@5: 0.980699\n",
- "[87]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979086\tvalid_0's ndcg@3: 0.980474\tvalid_0's ndcg@4: 0.980657\tvalid_0's ndcg@5: 0.980695\n",
- "[88]\tvalid_0's ndcg@1: 0.949025\tvalid_0's ndcg@2: 0.979136\tvalid_0's ndcg@3: 0.980499\tvalid_0's ndcg@4: 0.980682\tvalid_0's ndcg@5: 0.98072\n",
- "[89]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979285\tvalid_0's ndcg@3: 0.98061\tvalid_0's ndcg@4: 0.980793\tvalid_0's ndcg@5: 0.980832\n",
- "[90]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
- "[91]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
- "[92]\tvalid_0's ndcg@1: 0.9494\tvalid_0's ndcg@2: 0.97929\tvalid_0's ndcg@3: 0.98064\tvalid_0's ndcg@4: 0.980823\tvalid_0's ndcg@5: 0.980862\n",
- "[93]\tvalid_0's ndcg@1: 0.949375\tvalid_0's ndcg@2: 0.979297\tvalid_0's ndcg@3: 0.980634\tvalid_0's ndcg@4: 0.980817\tvalid_0's ndcg@5: 0.980856\n",
- "[94]\tvalid_0's ndcg@1: 0.949525\tvalid_0's ndcg@2: 0.979336\tvalid_0's ndcg@3: 0.980686\tvalid_0's ndcg@4: 0.980869\tvalid_0's ndcg@5: 0.980908\n",
- "[95]\tvalid_0's ndcg@1: 0.949825\tvalid_0's ndcg@2: 0.979416\tvalid_0's ndcg@3: 0.980791\tvalid_0's ndcg@4: 0.980974\tvalid_0's ndcg@5: 0.981012\n",
- "[96]\tvalid_0's ndcg@1: 0.94975\tvalid_0's ndcg@2: 0.979404\tvalid_0's ndcg@3: 0.980779\tvalid_0's ndcg@4: 0.980951\tvalid_0's ndcg@5: 0.98099\n",
- "[97]\tvalid_0's ndcg@1: 0.950025\tvalid_0's ndcg@2: 0.979537\tvalid_0's ndcg@3: 0.980874\tvalid_0's ndcg@4: 0.981057\tvalid_0's ndcg@5: 0.981096\n",
- "[98]\tvalid_0's ndcg@1: 0.9501\tvalid_0's ndcg@2: 0.979564\tvalid_0's ndcg@3: 0.980889\tvalid_0's ndcg@4: 0.981083\tvalid_0's ndcg@5: 0.981122\n",
- "[99]\tvalid_0's ndcg@1: 0.950275\tvalid_0's ndcg@2: 0.979629\tvalid_0's ndcg@3: 0.980967\tvalid_0's ndcg@4: 0.98115\tvalid_0's ndcg@5: 0.981188\n",
- "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
- "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.96382\tvalid_0's ndcg@3: 0.965707\tvalid_0's ndcg@4: 0.966009\tvalid_0's ndcg@5: 0.966086\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.91415\tvalid_0's ndcg@2: 0.965492\tvalid_0's ndcg@3: 0.967254\tvalid_0's ndcg@4: 0.967556\tvalid_0's ndcg@5: 0.967604\n",
- "[3]\tvalid_0's ndcg@1: 0.916025\tvalid_0's ndcg@2: 0.966389\tvalid_0's ndcg@3: 0.967976\tvalid_0's ndcg@4: 0.968278\tvalid_0's ndcg@5: 0.968355\n",
- "[4]\tvalid_0's ndcg@1: 0.919\tvalid_0's ndcg@2: 0.967392\tvalid_0's ndcg@3: 0.96903\tvalid_0's ndcg@4: 0.969364\tvalid_0's ndcg@5: 0.969431\n",
- "[5]\tvalid_0's ndcg@1: 0.921125\tvalid_0's ndcg@2: 0.968192\tvalid_0's ndcg@3: 0.969855\tvalid_0's ndcg@4: 0.970156\tvalid_0's ndcg@5: 0.970224\n",
- "[6]\tvalid_0's ndcg@1: 0.921675\tvalid_0's ndcg@2: 0.968411\tvalid_0's ndcg@3: 0.970111\tvalid_0's ndcg@4: 0.97037\tvalid_0's ndcg@5: 0.970437\n",
- "[7]\tvalid_0's ndcg@1: 0.9237\tvalid_0's ndcg@2: 0.969332\tvalid_0's ndcg@3: 0.970882\tvalid_0's ndcg@4: 0.97113\tvalid_0's ndcg@5: 0.971217\n",
- "[8]\tvalid_0's ndcg@1: 0.925775\tvalid_0's ndcg@2: 0.970129\tvalid_0's ndcg@3: 0.971642\tvalid_0's ndcg@4: 0.971922\tvalid_0's ndcg@5: 0.97199\n",
- "[9]\tvalid_0's ndcg@1: 0.926775\tvalid_0's ndcg@2: 0.970435\tvalid_0's ndcg@3: 0.971985\tvalid_0's ndcg@4: 0.972276\tvalid_0's ndcg@5: 0.972334\n"
- ]
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:01.809368Z",
+ "start_time": "2020-11-18T04:21:01.799641Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def submit(recall_df, topk=5, model_name=None):\n",
+ " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
+ " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 判断是不是每个用户都有5篇文章及以上\n",
+ " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
+ " assert tmp.min() >= topk\n",
+ " \n",
+ " del recall_df['pred_score']\n",
+ " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
+ " \n",
+ " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
+ " # 按照提交格式定义列名\n",
+ " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
+ " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
+ " \n",
+ " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
+ " submit.to_csv(save_name, index=False, header=True)"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[10]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970761\tvalid_0's ndcg@3: 0.972311\tvalid_0's ndcg@4: 0.972612\tvalid_0's ndcg@5: 0.97267\n",
- "[11]\tvalid_0's ndcg@1: 0.928975\tvalid_0's ndcg@2: 0.97131\tvalid_0's ndcg@3: 0.972798\tvalid_0's ndcg@4: 0.973089\tvalid_0's ndcg@5: 0.973166\n",
- "[12]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971505\tvalid_0's ndcg@3: 0.972968\tvalid_0's ndcg@4: 0.973259\tvalid_0's ndcg@5: 0.973326\n",
- "[13]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971426\tvalid_0's ndcg@3: 0.972939\tvalid_0's ndcg@4: 0.97324\tvalid_0's ndcg@5: 0.973318\n",
- "[14]\tvalid_0's ndcg@1: 0.929775\tvalid_0's ndcg@2: 0.971621\tvalid_0's ndcg@3: 0.973121\tvalid_0's ndcg@4: 0.973412\tvalid_0's ndcg@5: 0.97348\n",
- "[15]\tvalid_0's ndcg@1: 0.9304\tvalid_0's ndcg@2: 0.971868\tvalid_0's ndcg@3: 0.97338\tvalid_0's ndcg@4: 0.97365\tvalid_0's ndcg@5: 0.973717\n",
- "[16]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.972096\tvalid_0's ndcg@3: 0.973558\tvalid_0's ndcg@4: 0.973849\tvalid_0's ndcg@5: 0.973926\n",
- "[17]\tvalid_0's ndcg@1: 0.93105\tvalid_0's ndcg@2: 0.972108\tvalid_0's ndcg@3: 0.973583\tvalid_0's ndcg@4: 0.973884\tvalid_0's ndcg@5: 0.973952\n",
- "[18]\tvalid_0's ndcg@1: 0.931725\tvalid_0's ndcg@2: 0.972373\tvalid_0's ndcg@3: 0.97386\tvalid_0's ndcg@4: 0.974129\tvalid_0's ndcg@5: 0.974207\n",
- "[19]\tvalid_0's ndcg@1: 0.932175\tvalid_0's ndcg@2: 0.972681\tvalid_0's ndcg@3: 0.974068\tvalid_0's ndcg@4: 0.974348\tvalid_0's ndcg@5: 0.974406\n",
- "[20]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.973019\tvalid_0's ndcg@3: 0.974382\tvalid_0's ndcg@4: 0.974673\tvalid_0's ndcg@5: 0.974731\n",
- "[21]\tvalid_0's ndcg@1: 0.933075\tvalid_0's ndcg@2: 0.97306\tvalid_0's ndcg@3: 0.974423\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.97477\n",
- "[22]\tvalid_0's ndcg@1: 0.93375\tvalid_0's ndcg@2: 0.973262\tvalid_0's ndcg@3: 0.974649\tvalid_0's ndcg@4: 0.974929\tvalid_0's ndcg@5: 0.975007\n",
- "[23]\tvalid_0's ndcg@1: 0.933675\tvalid_0's ndcg@2: 0.973219\tvalid_0's ndcg@3: 0.974606\tvalid_0's ndcg@4: 0.974886\tvalid_0's ndcg@5: 0.974973\n",
- "[24]\tvalid_0's ndcg@1: 0.934\tvalid_0's ndcg@2: 0.97337\tvalid_0's ndcg@3: 0.974745\tvalid_0's ndcg@4: 0.975014\tvalid_0's ndcg@5: 0.975101\n",
- "[25]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973674\tvalid_0's ndcg@3: 0.975062\tvalid_0's ndcg@4: 0.975342\tvalid_0's ndcg@5: 0.97541\n",
- "[26]\tvalid_0's ndcg@1: 0.93495\tvalid_0's ndcg@2: 0.973721\tvalid_0's ndcg@3: 0.975096\tvalid_0's ndcg@4: 0.975365\tvalid_0's ndcg@5: 0.975452\n",
- "[27]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.974082\tvalid_0's ndcg@3: 0.975444\tvalid_0's ndcg@4: 0.975713\tvalid_0's ndcg@5: 0.975781\n",
- "[28]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973875\tvalid_0's ndcg@3: 0.975275\tvalid_0's ndcg@4: 0.975512\tvalid_0's ndcg@5: 0.975599\n",
- "[29]\tvalid_0's ndcg@1: 0.935925\tvalid_0's ndcg@2: 0.974159\tvalid_0's ndcg@3: 0.975522\tvalid_0's ndcg@4: 0.975759\tvalid_0's ndcg@5: 0.975836\n",
- "[30]\tvalid_0's ndcg@1: 0.9362\tvalid_0's ndcg@2: 0.974214\tvalid_0's ndcg@3: 0.975589\tvalid_0's ndcg@4: 0.975847\tvalid_0's ndcg@5: 0.975924\n",
- "[31]\tvalid_0's ndcg@1: 0.93625\tvalid_0's ndcg@2: 0.974216\tvalid_0's ndcg@3: 0.975629\tvalid_0's ndcg@4: 0.975876\tvalid_0's ndcg@5: 0.975944\n",
- "[32]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974427\tvalid_0's ndcg@3: 0.975814\tvalid_0's ndcg@4: 0.97603\tvalid_0's ndcg@5: 0.976107\n",
- "[33]\tvalid_0's ndcg@1: 0.936775\tvalid_0's ndcg@2: 0.974505\tvalid_0's ndcg@3: 0.975855\tvalid_0's ndcg@4: 0.976081\tvalid_0's ndcg@5: 0.976158\n",
- "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974643\tvalid_0's ndcg@3: 0.975993\tvalid_0's ndcg@4: 0.976219\tvalid_0's ndcg@5: 0.976296\n",
- "[35]\tvalid_0's ndcg@1: 0.937675\tvalid_0's ndcg@2: 0.974805\tvalid_0's ndcg@3: 0.97618\tvalid_0's ndcg@4: 0.976406\tvalid_0's ndcg@5: 0.976484\n",
- "[36]\tvalid_0's ndcg@1: 0.9382\tvalid_0's ndcg@2: 0.974983\tvalid_0's ndcg@3: 0.976371\tvalid_0's ndcg@4: 0.976597\tvalid_0's ndcg@5: 0.976674\n",
- "[37]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.974974\tvalid_0's ndcg@3: 0.976349\tvalid_0's ndcg@4: 0.976586\tvalid_0's ndcg@5: 0.976663\n",
- "[38]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.975143\tvalid_0's ndcg@3: 0.976518\tvalid_0's ndcg@4: 0.976776\tvalid_0's ndcg@5: 0.976844\n",
- "[39]\tvalid_0's ndcg@1: 0.938575\tvalid_0's ndcg@2: 0.975106\tvalid_0's ndcg@3: 0.976481\tvalid_0's ndcg@4: 0.976739\tvalid_0's ndcg@5: 0.976807\n",
- "[40]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.97519\tvalid_0's ndcg@3: 0.976528\tvalid_0's ndcg@4: 0.976775\tvalid_0's ndcg@5: 0.976853\n",
- "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975347\tvalid_0's ndcg@3: 0.976697\tvalid_0's ndcg@4: 0.976934\tvalid_0's ndcg@5: 0.977001\n",
- "[42]\tvalid_0's ndcg@1: 0.939825\tvalid_0's ndcg@2: 0.975599\tvalid_0's ndcg@3: 0.976961\tvalid_0's ndcg@4: 0.977198\tvalid_0's ndcg@5: 0.977266\n",
- "[43]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975639\tvalid_0's ndcg@3: 0.976977\tvalid_0's ndcg@4: 0.977214\tvalid_0's ndcg@5: 0.977282\n",
- "[44]\tvalid_0's ndcg@1: 0.9398\tvalid_0's ndcg@2: 0.975605\tvalid_0's ndcg@3: 0.976955\tvalid_0's ndcg@4: 0.977192\tvalid_0's ndcg@5: 0.97726\n",
- "[45]\tvalid_0's ndcg@1: 0.9401\tvalid_0's ndcg@2: 0.9757\tvalid_0's ndcg@3: 0.977075\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
- "[46]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975845\tvalid_0's ndcg@3: 0.977183\tvalid_0's ndcg@4: 0.97742\tvalid_0's ndcg@5: 0.977497\n",
- "[47]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975854\tvalid_0's ndcg@3: 0.977204\tvalid_0's ndcg@4: 0.97743\tvalid_0's ndcg@5: 0.977508\n",
- "[48]\tvalid_0's ndcg@1: 0.940575\tvalid_0's ndcg@2: 0.975923\tvalid_0's ndcg@3: 0.977273\tvalid_0's ndcg@4: 0.977488\tvalid_0's ndcg@5: 0.977556\n",
- "[49]\tvalid_0's ndcg@1: 0.9407\tvalid_0's ndcg@2: 0.975922\tvalid_0's ndcg@3: 0.977297\tvalid_0's ndcg@4: 0.977501\tvalid_0's ndcg@5: 0.977588\n",
- "[50]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977322\tvalid_0's ndcg@4: 0.977505\tvalid_0's ndcg@5: 0.977592\n",
- "[51]\tvalid_0's ndcg@1: 0.9406\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.97725\tvalid_0's ndcg@4: 0.977422\tvalid_0's ndcg@5: 0.977509\n",
- "[52]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975997\tvalid_0's ndcg@3: 0.977422\tvalid_0's ndcg@4: 0.977594\tvalid_0's ndcg@5: 0.977691\n",
- "[53]\tvalid_0's ndcg@1: 0.940925\tvalid_0's ndcg@2: 0.975989\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977538\tvalid_0's ndcg@5: 0.977644\n",
- "[54]\tvalid_0's ndcg@1: 0.94125\tvalid_0's ndcg@2: 0.976062\tvalid_0's ndcg@3: 0.977487\tvalid_0's ndcg@4: 0.977659\tvalid_0's ndcg@5: 0.977756\n",
- "[55]\tvalid_0's ndcg@1: 0.94145\tvalid_0's ndcg@2: 0.976183\tvalid_0's ndcg@3: 0.97757\tvalid_0's ndcg@4: 0.977742\tvalid_0's ndcg@5: 0.977839\n",
- "[56]\tvalid_0's ndcg@1: 0.941475\tvalid_0's ndcg@2: 0.976176\tvalid_0's ndcg@3: 0.977576\tvalid_0's ndcg@4: 0.977748\tvalid_0's ndcg@5: 0.977845\n",
- "[57]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976139\tvalid_0's ndcg@3: 0.977539\tvalid_0's ndcg@4: 0.977712\tvalid_0's ndcg@5: 0.977808\n",
- "[58]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.97625\tvalid_0's ndcg@3: 0.97765\tvalid_0's ndcg@4: 0.977822\tvalid_0's ndcg@5: 0.977919\n",
- "[59]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.976253\tvalid_0's ndcg@3: 0.977653\tvalid_0's ndcg@4: 0.977836\tvalid_0's ndcg@5: 0.977932\n",
- "[60]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977634\tvalid_0's ndcg@4: 0.977817\tvalid_0's ndcg@5: 0.977914\n",
- "[61]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.976333\tvalid_0's ndcg@3: 0.977745\tvalid_0's ndcg@4: 0.977918\tvalid_0's ndcg@5: 0.978005\n",
- "[62]\tvalid_0's ndcg@1: 0.941975\tvalid_0's ndcg@2: 0.976345\tvalid_0's ndcg@3: 0.977757\tvalid_0's ndcg@4: 0.97794\tvalid_0's ndcg@5: 0.978027\n",
- "[63]\tvalid_0's ndcg@1: 0.9423\tvalid_0's ndcg@2: 0.976496\tvalid_0's ndcg@3: 0.977871\tvalid_0's ndcg@4: 0.978065\tvalid_0's ndcg@5: 0.978152\n",
- "[64]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976632\tvalid_0's ndcg@3: 0.977995\tvalid_0's ndcg@4: 0.978188\tvalid_0's ndcg@5: 0.978275\n",
- "[65]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976629\tvalid_0's ndcg@3: 0.977979\tvalid_0's ndcg@4: 0.978173\tvalid_0's ndcg@5: 0.97826\n",
- "[66]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976685\tvalid_0's ndcg@3: 0.978035\tvalid_0's ndcg@4: 0.978229\tvalid_0's ndcg@5: 0.978316\n",
- "[67]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976678\tvalid_0's ndcg@3: 0.978041\tvalid_0's ndcg@4: 0.978224\tvalid_0's ndcg@5: 0.97832\n",
- "[68]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976694\tvalid_0's ndcg@3: 0.978044\tvalid_0's ndcg@4: 0.978227\tvalid_0's ndcg@5: 0.978324\n",
- "[69]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976834\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978329\tvalid_0's ndcg@5: 0.978426\n",
- "[70]\tvalid_0's ndcg@1: 0.943025\tvalid_0's ndcg@2: 0.976827\tvalid_0's ndcg@3: 0.978152\tvalid_0's ndcg@4: 0.978324\tvalid_0's ndcg@5: 0.978431\n",
- "[71]\tvalid_0's ndcg@1: 0.9432\tvalid_0's ndcg@2: 0.976923\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978504\n",
- "[72]\tvalid_0's ndcg@1: 0.943225\tvalid_0's ndcg@2: 0.976917\tvalid_0's ndcg@3: 0.978254\tvalid_0's ndcg@4: 0.978405\tvalid_0's ndcg@5: 0.978511\n",
- "[73]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976936\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978409\tvalid_0's ndcg@5: 0.978496\n"
- ]
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:04.332198Z",
+ "start_time": "2020-11-18T04:21:04.325020Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序结果归一化\n",
+ "def norm_sim(sim_df, weight=0.0):\n",
+ " # print(sim_df.head())\n",
+ " min_sim = sim_df.min()\n",
+ " max_sim = sim_df.max()\n",
+ " if max_sim == min_sim:\n",
+ " sim_df = sim_df.apply(lambda sim: 1.0)\n",
+ " else:\n",
+ " sim_df = sim_df.apply(lambda sim: 1.0 * (sim - min_sim) / (max_sim - min_sim))\n",
+ "\n",
+ " sim_df = sim_df.apply(lambda sim: sim + weight) # plus one\n",
+ " return sim_df"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[74]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976957\tvalid_0's ndcg@3: 0.97827\tvalid_0's ndcg@4: 0.978431\tvalid_0's ndcg@5: 0.978528\n",
- "[75]\tvalid_0's ndcg@1: 0.943075\tvalid_0's ndcg@2: 0.976861\tvalid_0's ndcg@3: 0.978199\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978457\n",
- "[76]\tvalid_0's ndcg@1: 0.94335\tvalid_0's ndcg@2: 0.976963\tvalid_0's ndcg@3: 0.978288\tvalid_0's ndcg@4: 0.978471\tvalid_0's ndcg@5: 0.978568\n",
- "[77]\tvalid_0's ndcg@1: 0.94345\tvalid_0's ndcg@2: 0.977031\tvalid_0's ndcg@3: 0.978331\tvalid_0's ndcg@4: 0.978514\tvalid_0's ndcg@5: 0.978611\n",
- "[78]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.977088\tvalid_0's ndcg@3: 0.97835\tvalid_0's ndcg@4: 0.978533\tvalid_0's ndcg@5: 0.97863\n",
- "[79]\tvalid_0's ndcg@1: 0.943625\tvalid_0's ndcg@2: 0.977096\tvalid_0's ndcg@3: 0.978396\tvalid_0's ndcg@4: 0.978579\tvalid_0's ndcg@5: 0.978676\n",
- "[80]\tvalid_0's ndcg@1: 0.943825\tvalid_0's ndcg@2: 0.977154\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978651\tvalid_0's ndcg@5: 0.978748\n",
- "[81]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.97846\tvalid_0's ndcg@4: 0.978633\tvalid_0's ndcg@5: 0.978729\n",
- "[82]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.977361\tvalid_0's ndcg@3: 0.978673\tvalid_0's ndcg@4: 0.978845\tvalid_0's ndcg@5: 0.978933\n",
- "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977324\tvalid_0's ndcg@3: 0.978624\tvalid_0's ndcg@4: 0.978796\tvalid_0's ndcg@5: 0.978893\n",
- "[84]\tvalid_0's ndcg@1: 0.94405\tvalid_0's ndcg@2: 0.977253\tvalid_0's ndcg@3: 0.978565\tvalid_0's ndcg@4: 0.978737\tvalid_0's ndcg@5: 0.978834\n",
- "[85]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977283\tvalid_0's ndcg@3: 0.978633\tvalid_0's ndcg@4: 0.978795\tvalid_0's ndcg@5: 0.978882\n",
- "[86]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.97745\tvalid_0's ndcg@3: 0.978763\tvalid_0's ndcg@4: 0.978924\tvalid_0's ndcg@5: 0.979011\n",
- "[87]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.977419\tvalid_0's ndcg@3: 0.978756\tvalid_0's ndcg@4: 0.978918\tvalid_0's ndcg@5: 0.979005\n",
- "[88]\tvalid_0's ndcg@1: 0.944825\tvalid_0's ndcg@2: 0.977554\tvalid_0's ndcg@3: 0.978867\tvalid_0's ndcg@4: 0.979039\tvalid_0's ndcg@5: 0.979126\n",
- "[89]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977767\tvalid_0's ndcg@3: 0.979079\tvalid_0's ndcg@4: 0.979262\tvalid_0's ndcg@5: 0.97934\n",
- "[90]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977773\tvalid_0's ndcg@3: 0.979073\tvalid_0's ndcg@4: 0.979256\tvalid_0's ndcg@5: 0.979334\n",
- "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977792\tvalid_0's ndcg@3: 0.979092\tvalid_0's ndcg@4: 0.979275\tvalid_0's ndcg@5: 0.979352\n",
- "[92]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977776\tvalid_0's ndcg@3: 0.979088\tvalid_0's ndcg@4: 0.979261\tvalid_0's ndcg@5: 0.979348\n",
- "[93]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977757\tvalid_0's ndcg@3: 0.979082\tvalid_0's ndcg@4: 0.979244\tvalid_0's ndcg@5: 0.979331\n",
- "[94]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977761\tvalid_0's ndcg@3: 0.979061\tvalid_0's ndcg@4: 0.979223\tvalid_0's ndcg@5: 0.97931\n",
- "[95]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977798\tvalid_0's ndcg@3: 0.979086\tvalid_0's ndcg@4: 0.979258\tvalid_0's ndcg@5: 0.979345\n",
- "[96]\tvalid_0's ndcg@1: 0.945825\tvalid_0's ndcg@2: 0.977955\tvalid_0's ndcg@3: 0.97923\tvalid_0's ndcg@4: 0.979413\tvalid_0's ndcg@5: 0.9795\n",
- "[97]\tvalid_0's ndcg@1: 0.945925\tvalid_0's ndcg@2: 0.97796\tvalid_0's ndcg@3: 0.97926\tvalid_0's ndcg@4: 0.979443\tvalid_0's ndcg@5: 0.979531\n",
- "[98]\tvalid_0's ndcg@1: 0.9464\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.97942\tvalid_0's ndcg@4: 0.979625\tvalid_0's ndcg@5: 0.979702\n",
- "[99]\tvalid_0's ndcg@1: 0.94655\tvalid_0's ndcg@2: 0.978191\tvalid_0's ndcg@3: 0.979479\tvalid_0's ndcg@4: 0.979683\tvalid_0's ndcg@5: 0.97977\n",
- "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
- "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.963031\tvalid_0's ndcg@3: 0.965281\tvalid_0's ndcg@4: 0.965819\tvalid_0's ndcg@5: 0.965887\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9141\tvalid_0's ndcg@2: 0.964748\tvalid_0's ndcg@3: 0.96681\tvalid_0's ndcg@4: 0.967316\tvalid_0's ndcg@5: 0.967394\n",
- "[3]\tvalid_0's ndcg@1: 0.915925\tvalid_0's ndcg@2: 0.9655\tvalid_0's ndcg@3: 0.967575\tvalid_0's ndcg@4: 0.968028\tvalid_0's ndcg@5: 0.968105\n",
- "[4]\tvalid_0's ndcg@1: 0.91915\tvalid_0's ndcg@2: 0.966943\tvalid_0's ndcg@3: 0.968968\tvalid_0's ndcg@4: 0.969334\tvalid_0's ndcg@5: 0.969373\n",
- "[5]\tvalid_0's ndcg@1: 0.920625\tvalid_0's ndcg@2: 0.967598\tvalid_0's ndcg@3: 0.969498\tvalid_0's ndcg@4: 0.969896\tvalid_0's ndcg@5: 0.969944\n",
- "[6]\tvalid_0's ndcg@1: 0.922625\tvalid_0's ndcg@2: 0.968336\tvalid_0's ndcg@3: 0.970261\tvalid_0's ndcg@4: 0.970659\tvalid_0's ndcg@5: 0.970688\n",
- "[7]\tvalid_0's ndcg@1: 0.923625\tvalid_0's ndcg@2: 0.968768\tvalid_0's ndcg@3: 0.970656\tvalid_0's ndcg@4: 0.971043\tvalid_0's ndcg@5: 0.971072\n",
- "[8]\tvalid_0's ndcg@1: 0.925825\tvalid_0's ndcg@2: 0.969612\tvalid_0's ndcg@3: 0.971462\tvalid_0's ndcg@4: 0.97186\tvalid_0's ndcg@5: 0.971879\n",
- "[9]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.969899\tvalid_0's ndcg@3: 0.971711\tvalid_0's ndcg@4: 0.97211\tvalid_0's ndcg@5: 0.972129\n",
- "[10]\tvalid_0's ndcg@1: 0.927775\tvalid_0's ndcg@2: 0.97041\tvalid_0's ndcg@3: 0.972185\tvalid_0's ndcg@4: 0.972594\tvalid_0's ndcg@5: 0.972614\n",
- "[11]\tvalid_0's ndcg@1: 0.92885\tvalid_0's ndcg@2: 0.970838\tvalid_0's ndcg@3: 0.972588\tvalid_0's ndcg@4: 0.973008\tvalid_0's ndcg@5: 0.973028\n",
- "[12]\tvalid_0's ndcg@1: 0.930325\tvalid_0's ndcg@2: 0.971367\tvalid_0's ndcg@3: 0.973129\tvalid_0's ndcg@4: 0.973549\tvalid_0's ndcg@5: 0.973569\n",
- "[13]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971631\tvalid_0's ndcg@3: 0.973443\tvalid_0's ndcg@4: 0.973842\tvalid_0's ndcg@5: 0.973871\n",
- "[14]\tvalid_0's ndcg@1: 0.931525\tvalid_0's ndcg@2: 0.971778\tvalid_0's ndcg@3: 0.973616\tvalid_0's ndcg@4: 0.973993\tvalid_0's ndcg@5: 0.974022\n",
- "[15]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.9717\tvalid_0's ndcg@3: 0.973475\tvalid_0's ndcg@4: 0.973852\tvalid_0's ndcg@5: 0.973872\n",
- "[16]\tvalid_0's ndcg@1: 0.931775\tvalid_0's ndcg@2: 0.971902\tvalid_0's ndcg@3: 0.973702\tvalid_0's ndcg@4: 0.97409\tvalid_0's ndcg@5: 0.974109\n",
- "[17]\tvalid_0's ndcg@1: 0.931425\tvalid_0's ndcg@2: 0.971805\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973967\tvalid_0's ndcg@5: 0.973986\n",
- "[18]\tvalid_0's ndcg@1: 0.931575\tvalid_0's ndcg@2: 0.971876\tvalid_0's ndcg@3: 0.973651\tvalid_0's ndcg@4: 0.974027\tvalid_0's ndcg@5: 0.974047\n",
- "[19]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.97208\tvalid_0's ndcg@3: 0.973805\tvalid_0's ndcg@4: 0.974192\tvalid_0's ndcg@5: 0.974212\n",
- "[20]\tvalid_0's ndcg@1: 0.932075\tvalid_0's ndcg@2: 0.972092\tvalid_0's ndcg@3: 0.973829\tvalid_0's ndcg@4: 0.974217\tvalid_0's ndcg@5: 0.974236\n",
- "[21]\tvalid_0's ndcg@1: 0.932675\tvalid_0's ndcg@2: 0.972282\tvalid_0's ndcg@3: 0.974057\tvalid_0's ndcg@4: 0.974444\tvalid_0's ndcg@5: 0.974454\n",
- "[22]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972358\tvalid_0's ndcg@3: 0.974146\tvalid_0's ndcg@4: 0.974533\tvalid_0's ndcg@5: 0.974543\n",
- "[23]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972478\tvalid_0's ndcg@3: 0.974253\tvalid_0's ndcg@4: 0.974651\tvalid_0's ndcg@5: 0.974661\n",
- "[24]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972539\tvalid_0's ndcg@3: 0.974351\tvalid_0's ndcg@4: 0.974739\tvalid_0's ndcg@5: 0.974749\n",
- "[25]\tvalid_0's ndcg@1: 0.93475\tvalid_0's ndcg@2: 0.973\tvalid_0's ndcg@3: 0.974788\tvalid_0's ndcg@4: 0.975197\tvalid_0's ndcg@5: 0.975206\n",
- "[26]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.97312\tvalid_0's ndcg@3: 0.974895\tvalid_0's ndcg@4: 0.975315\tvalid_0's ndcg@5: 0.975325\n",
- "[27]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.974865\tvalid_0's ndcg@4: 0.975264\tvalid_0's ndcg@5: 0.975273\n",
- "[28]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974939\tvalid_0's ndcg@4: 0.975327\tvalid_0's ndcg@5: 0.975336\n",
- "[29]\tvalid_0's ndcg@1: 0.935475\tvalid_0's ndcg@2: 0.973315\tvalid_0's ndcg@3: 0.975128\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975492\n",
- "[30]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973522\tvalid_0's ndcg@3: 0.975297\tvalid_0's ndcg@4: 0.975663\tvalid_0's ndcg@5: 0.975673\n",
- "[31]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973506\tvalid_0's ndcg@3: 0.975281\tvalid_0's ndcg@4: 0.975658\tvalid_0's ndcg@5: 0.975668\n",
- "[32]\tvalid_0's ndcg@1: 0.93675\tvalid_0's ndcg@2: 0.973833\tvalid_0's ndcg@3: 0.975595\tvalid_0's ndcg@4: 0.975961\tvalid_0's ndcg@5: 0.975971\n",
- "[33]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.973763\tvalid_0's ndcg@3: 0.975488\tvalid_0's ndcg@4: 0.975865\tvalid_0's ndcg@5: 0.975874\n",
- "[34]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973893\tvalid_0's ndcg@3: 0.975568\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975966\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## LGB排序模型"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[35]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974059\tvalid_0's ndcg@3: 0.975722\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.97613\n",
- "[36]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
- "[37]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
- "[38]\tvalid_0's ndcg@1: 0.938725\tvalid_0's ndcg@2: 0.974672\tvalid_0's ndcg@3: 0.97636\tvalid_0's ndcg@4: 0.976715\tvalid_0's ndcg@5: 0.976725\n",
- "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974676\tvalid_0's ndcg@3: 0.976364\tvalid_0's ndcg@4: 0.976697\tvalid_0's ndcg@5: 0.976707\n",
- "[40]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.974867\tvalid_0's ndcg@3: 0.97653\tvalid_0's ndcg@4: 0.976874\tvalid_0's ndcg@5: 0.976884\n",
- "[41]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975042\tvalid_0's ndcg@3: 0.976705\tvalid_0's ndcg@4: 0.97705\tvalid_0's ndcg@5: 0.977059\n",
- "[42]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976784\tvalid_0's ndcg@4: 0.977129\tvalid_0's ndcg@5: 0.977138\n",
- "[43]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.97517\tvalid_0's ndcg@3: 0.97687\tvalid_0's ndcg@4: 0.977215\tvalid_0's ndcg@5: 0.977225\n",
- "[44]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.97534\tvalid_0's ndcg@3: 0.977015\tvalid_0's ndcg@4: 0.97736\tvalid_0's ndcg@5: 0.97737\n",
- "[45]\tvalid_0's ndcg@1: 0.94055\tvalid_0's ndcg@2: 0.975409\tvalid_0's ndcg@3: 0.977059\tvalid_0's ndcg@4: 0.977403\tvalid_0's ndcg@5: 0.977413\n",
- "[46]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975415\tvalid_0's ndcg@3: 0.97704\tvalid_0's ndcg@4: 0.977396\tvalid_0's ndcg@5: 0.977405\n",
- "[47]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975363\tvalid_0's ndcg@3: 0.977013\tvalid_0's ndcg@4: 0.977357\tvalid_0's ndcg@5: 0.977367\n",
- "[48]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975388\tvalid_0's ndcg@3: 0.977025\tvalid_0's ndcg@4: 0.97737\tvalid_0's ndcg@5: 0.977379\n",
- "[49]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975447\tvalid_0's ndcg@3: 0.977097\tvalid_0's ndcg@4: 0.977409\tvalid_0's ndcg@5: 0.977419\n",
- "[50]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975666\tvalid_0's ndcg@3: 0.977303\tvalid_0's ndcg@4: 0.977615\tvalid_0's ndcg@5: 0.977625\n",
- "[51]\tvalid_0's ndcg@1: 0.94135\tvalid_0's ndcg@2: 0.975751\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.97771\tvalid_0's ndcg@5: 0.97772\n",
- "[52]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.975717\tvalid_0's ndcg@3: 0.977355\tvalid_0's ndcg@4: 0.977688\tvalid_0's ndcg@5: 0.977698\n",
- "[53]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.975713\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977699\tvalid_0's ndcg@5: 0.977718\n",
- "[54]\tvalid_0's ndcg@1: 0.94185\tvalid_0's ndcg@2: 0.975857\tvalid_0's ndcg@3: 0.977557\tvalid_0's ndcg@4: 0.977869\tvalid_0's ndcg@5: 0.977889\n",
- "[55]\tvalid_0's ndcg@1: 0.941925\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.9776\tvalid_0's ndcg@4: 0.977891\tvalid_0's ndcg@5: 0.97791\n",
- "[56]\tvalid_0's ndcg@1: 0.942325\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977719\tvalid_0's ndcg@4: 0.978032\tvalid_0's ndcg@5: 0.978051\n",
- "[57]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977772\tvalid_0's ndcg@4: 0.978073\tvalid_0's ndcg@5: 0.978093\n",
- "[58]\tvalid_0's ndcg@1: 0.9425\tvalid_0's ndcg@2: 0.976081\tvalid_0's ndcg@3: 0.977806\tvalid_0's ndcg@4: 0.978108\tvalid_0's ndcg@5: 0.978127\n",
- "[59]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977788\tvalid_0's ndcg@4: 0.978079\tvalid_0's ndcg@5: 0.978098\n",
- "[60]\tvalid_0's ndcg@1: 0.942375\tvalid_0's ndcg@2: 0.976067\tvalid_0's ndcg@3: 0.977779\tvalid_0's ndcg@4: 0.97807\tvalid_0's ndcg@5: 0.978089\n",
- "[61]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976043\tvalid_0's ndcg@3: 0.97773\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.97804\n",
- "[62]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976117\tvalid_0's ndcg@3: 0.977792\tvalid_0's ndcg@4: 0.978093\tvalid_0's ndcg@5: 0.978112\n",
- "[63]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977881\tvalid_0's ndcg@4: 0.978182\tvalid_0's ndcg@5: 0.978201\n",
- "[64]\tvalid_0's ndcg@1: 0.942925\tvalid_0's ndcg@2: 0.976254\tvalid_0's ndcg@3: 0.977966\tvalid_0's ndcg@4: 0.978268\tvalid_0's ndcg@5: 0.978287\n",
- "[65]\tvalid_0's ndcg@1: 0.9431\tvalid_0's ndcg@2: 0.97635\tvalid_0's ndcg@3: 0.978025\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978357\n",
- "[66]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976445\tvalid_0's ndcg@3: 0.978132\tvalid_0's ndcg@4: 0.978445\tvalid_0's ndcg@5: 0.978464\n",
- "[67]\tvalid_0's ndcg@1: 0.943275\tvalid_0's ndcg@2: 0.976399\tvalid_0's ndcg@3: 0.978074\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978416\n",
- "[68]\tvalid_0's ndcg@1: 0.943325\tvalid_0's ndcg@2: 0.976401\tvalid_0's ndcg@3: 0.978089\tvalid_0's ndcg@4: 0.978412\tvalid_0's ndcg@5: 0.978431\n",
- "[69]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976578\tvalid_0's ndcg@3: 0.97819\tvalid_0's ndcg@4: 0.978546\tvalid_0's ndcg@5: 0.978565\n",
- "[70]\tvalid_0's ndcg@1: 0.944025\tvalid_0's ndcg@2: 0.976707\tvalid_0's ndcg@3: 0.97832\tvalid_0's ndcg@4: 0.978675\tvalid_0's ndcg@5: 0.978694\n",
- "[71]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976772\tvalid_0's ndcg@3: 0.978384\tvalid_0's ndcg@4: 0.97874\tvalid_0's ndcg@5: 0.978759\n",
- "[72]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978409\tvalid_0's ndcg@4: 0.978765\tvalid_0's ndcg@5: 0.978784\n",
- "[73]\tvalid_0's ndcg@1: 0.94445\tvalid_0's ndcg@2: 0.976864\tvalid_0's ndcg@3: 0.978464\tvalid_0's ndcg@4: 0.97883\tvalid_0's ndcg@5: 0.978849\n",
- "[74]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
- "[75]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
- "[76]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.97696\tvalid_0's ndcg@3: 0.978535\tvalid_0's ndcg@4: 0.978901\tvalid_0's ndcg@5: 0.978921\n",
- "[77]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
- "[78]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
- "[79]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976907\tvalid_0's ndcg@3: 0.978507\tvalid_0's ndcg@4: 0.978863\tvalid_0's ndcg@5: 0.978882\n",
- "[80]\tvalid_0's ndcg@1: 0.94455\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97851\tvalid_0's ndcg@4: 0.978865\tvalid_0's ndcg@5: 0.978885\n",
- "[81]\tvalid_0's ndcg@1: 0.944725\tvalid_0's ndcg@2: 0.97695\tvalid_0's ndcg@3: 0.978575\tvalid_0's ndcg@4: 0.978919\tvalid_0's ndcg@5: 0.978948\n",
- "[82]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.977103\tvalid_0's ndcg@3: 0.978765\tvalid_0's ndcg@4: 0.97911\tvalid_0's ndcg@5: 0.979129\n",
- "[83]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977066\tvalid_0's ndcg@3: 0.978716\tvalid_0's ndcg@4: 0.979071\tvalid_0's ndcg@5: 0.97909\n",
- "[84]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.97715\tvalid_0's ndcg@3: 0.978775\tvalid_0's ndcg@4: 0.97912\tvalid_0's ndcg@5: 0.979139\n",
- "[85]\tvalid_0's ndcg@1: 0.945025\tvalid_0's ndcg@2: 0.977092\tvalid_0's ndcg@3: 0.978692\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979067\n",
- "[86]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977172\tvalid_0's ndcg@3: 0.97876\tvalid_0's ndcg@4: 0.979115\tvalid_0's ndcg@5: 0.979135\n",
- "[87]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.97879\tvalid_0's ndcg@4: 0.979156\tvalid_0's ndcg@5: 0.979166\n",
- "[88]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.978815\tvalid_0's ndcg@4: 0.979149\tvalid_0's ndcg@5: 0.979168\n",
- "[89]\tvalid_0's ndcg@1: 0.94555\tvalid_0's ndcg@2: 0.977333\tvalid_0's ndcg@3: 0.978933\tvalid_0's ndcg@4: 0.979267\tvalid_0's ndcg@5: 0.979277\n",
- "[90]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977462\tvalid_0's ndcg@3: 0.979062\tvalid_0's ndcg@4: 0.979396\tvalid_0's ndcg@5: 0.979406\n",
- "[91]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977481\tvalid_0's ndcg@3: 0.979081\tvalid_0's ndcg@4: 0.979414\tvalid_0's ndcg@5: 0.979424\n",
- "[92]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977437\tvalid_0's ndcg@3: 0.97905\tvalid_0's ndcg@4: 0.979384\tvalid_0's ndcg@5: 0.979393\n",
- "[93]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977421\tvalid_0's ndcg@3: 0.979046\tvalid_0's ndcg@4: 0.97938\tvalid_0's ndcg@5: 0.97939\n",
- "[94]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977431\tvalid_0's ndcg@3: 0.979068\tvalid_0's ndcg@4: 0.979391\tvalid_0's ndcg@5: 0.979401\n",
- "[95]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977449\tvalid_0's ndcg@3: 0.979074\tvalid_0's ndcg@4: 0.979408\tvalid_0's ndcg@5: 0.979418\n",
- "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979127\tvalid_0's ndcg@4: 0.979461\tvalid_0's ndcg@5: 0.97947\n"
- ]
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:07.787698Z",
+ "start_time": "2020-11-18T04:21:07.536514Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 防止中间出错之后重新读取数据\n",
+ "trn_user_item_feats_df_rank_model = trn_user_item_feats_df.copy()\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_rank_model = val_user_item_feats_df.copy()\n",
+ " \n",
+ "tst_user_item_feats_df_rank_model = tst_user_item_feats_df.copy()"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[97]\tvalid_0's ndcg@1: 0.946375\tvalid_0's ndcg@2: 0.977622\tvalid_0's ndcg@3: 0.979222\tvalid_0's ndcg@4: 0.979577\tvalid_0's ndcg@5: 0.979577\n",
- "[98]\tvalid_0's ndcg@1: 0.946625\tvalid_0's ndcg@2: 0.977714\tvalid_0's ndcg@3: 0.979339\tvalid_0's ndcg@4: 0.979673\tvalid_0's ndcg@5: 0.979673\n",
- "[99]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.977739\tvalid_0's ndcg@3: 0.979352\tvalid_0's ndcg@4: 0.979685\tvalid_0's ndcg@5: 0.979685\n",
- "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n"
- ]
- }
- ],
- "source": [
- "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
- "# 这一部分与前面的单独训练和验证是分开的\n",
- "def get_kfold_users(trn_df, n=5):\n",
- " user_ids = trn_df['user_id'].unique()\n",
- " user_set = [user_ids[i::n] for i in range(n)]\n",
- " return user_set\n",
- "\n",
- "k_fold = 5\n",
- "trn_df = trn_user_item_feats_df_rank_model\n",
- "user_set = get_kfold_users(trn_df, n=k_fold)\n",
- "\n",
- "score_list = []\n",
- "score_df = trn_df[['user_id', 'click_article_id','label']]\n",
- "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
- "\n",
- "# 五折交叉验证,并将中间结果保存用于staking\n",
- "for n_fold, valid_user in enumerate(user_set):\n",
- " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
- " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
- " \n",
- " # 训练集与验证集的用户分组\n",
- " train_idx.sort_values(by=['user_id'], inplace=True)\n",
- " g_train = train_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
- " \n",
- " valid_idx.sort_values(by=['user_id'], inplace=True)\n",
- " g_val = valid_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
- " \n",
- " # 定义模型\n",
- " lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) \n",
- " # 训练模型\n",
- " lgb_ranker.fit(train_idx[lgb_cols], train_idx['label'], group=g_train,\n",
- " eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], eval_group= [g_val], \n",
- " eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
- " \n",
- " # 预测验证集结果\n",
- " valid_idx['pred_score'] = lgb_ranker.predict(valid_idx[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
- " \n",
- " # 对输出结果进行归一化\n",
- " valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
- " \n",
- " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
- " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
- " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
- " \n",
- " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
- " if not offline:\n",
- " sub_preds += lgb_ranker.predict(tst_user_item_feats_df_rank_model[lgb_cols], lgb_ranker.best_iteration_)\n",
- " \n",
- "score_df_ = pd.concat(score_list, axis=0)\n",
- "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
- "# 保存训练集交叉验证产生的新特征\n",
- "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_ranker_feats.csv', index=False)\n",
- " \n",
- "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
- "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
- "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- "\n",
- "# 保存测试集交叉验证的新特征\n",
- "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_ranker_feats.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:22:52.604397Z",
- "start_time": "2020-11-18T04:22:43.253034Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "# 单模型生成提交结果\n",
- "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_ranker')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## LGB分类模型"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:22:58.259730Z",
- "start_time": "2020-11-18T04:22:58.254297Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型及参数的定义\n",
- "lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:23:11.258774Z",
- "start_time": "2020-11-18T04:23:00.861936Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型训练\n",
- "if offline:\n",
- " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'],\n",
- " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
- " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
- "else:\n",
- " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:23:19.591396Z",
- "start_time": "2020-11-18T04:23:13.813850Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型预测\n",
- "tst_user_item_feats_df['pred_score'] = lgb_Classfication.predict_proba(tst_user_item_feats_df[lgb_cols])[:,1]\n",
- "\n",
- "# 将这里的排序结果保存一份,用户后面的模型融合\n",
- "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_cls_score.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:23:32.352931Z",
- "start_time": "2020-11-18T04:23:22.346609Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_cls')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:24:11.241196Z",
- "start_time": "2020-11-18T04:23:41.377394Z"
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:10.839656Z",
+ "start_time": "2020-11-18T04:21:10.833109Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 定义特征列\n",
+ "lgb_cols = ['sim0', 'time_diff0', 'word_diff0','sim_max', 'sim_min', 'sim_sum', \n",
+ " 'sim_mean', 'score','click_size', 'time_diff_mean', 'active_level',\n",
+ " 'click_environment','click_deviceGroup', 'click_os', 'click_country', \n",
+ " 'click_region','click_referrer_type', 'user_time_hob1', 'user_time_hob2',\n",
+ " 'words_hbo', 'category_id', 'created_at_ts','words_count']"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[1]\tvalid_0's auc: 0.764896\tvalid_0's binary_logloss: 0.522153\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.767857\tvalid_0's binary_logloss: 0.52057\n",
- "[3]\tvalid_0's auc: 0.783096\tvalid_0's binary_logloss: 0.519584\n",
- "[4]\tvalid_0's auc: 0.784354\tvalid_0's binary_logloss: 0.518485\n",
- "[5]\tvalid_0's auc: 0.790554\tvalid_0's binary_logloss: 0.516886\n",
- "[6]\tvalid_0's auc: 0.791954\tvalid_0's binary_logloss: 0.515334\n",
- "[7]\tvalid_0's auc: 0.794257\tvalid_0's binary_logloss: 0.514032\n",
- "[8]\tvalid_0's auc: 0.795222\tvalid_0's binary_logloss: 0.512516\n",
- "[9]\tvalid_0's auc: 0.795417\tvalid_0's binary_logloss: 0.511671\n",
- "[10]\tvalid_0's auc: 0.795913\tvalid_0's binary_logloss: 0.510226\n",
- "[11]\tvalid_0's auc: 0.798222\tvalid_0's binary_logloss: 0.508858\n",
- "[12]\tvalid_0's auc: 0.79825\tvalid_0's binary_logloss: 0.507928\n",
- "[13]\tvalid_0's auc: 0.798842\tvalid_0's binary_logloss: 0.50708\n",
- "[14]\tvalid_0's auc: 0.798935\tvalid_0's binary_logloss: 0.505752\n",
- "[15]\tvalid_0's auc: 0.799543\tvalid_0's binary_logloss: 0.504388\n",
- "[16]\tvalid_0's auc: 0.800844\tvalid_0's binary_logloss: 0.503126\n",
- "[17]\tvalid_0's auc: 0.800855\tvalid_0's binary_logloss: 0.501809\n",
- "[18]\tvalid_0's auc: 0.801653\tvalid_0's binary_logloss: 0.500676\n",
- "[19]\tvalid_0's auc: 0.801518\tvalid_0's binary_logloss: 0.49987\n",
- "[20]\tvalid_0's auc: 0.801662\tvalid_0's binary_logloss: 0.498625\n",
- "[21]\tvalid_0's auc: 0.802093\tvalid_0's binary_logloss: 0.498113\n",
- "[22]\tvalid_0's auc: 0.803071\tvalid_0's binary_logloss: 0.496933\n",
- "[23]\tvalid_0's auc: 0.803222\tvalid_0's binary_logloss: 0.495864\n",
- "[24]\tvalid_0's auc: 0.802927\tvalid_0's binary_logloss: 0.494691\n",
- "[25]\tvalid_0's auc: 0.802581\tvalid_0's binary_logloss: 0.493543\n",
- "[26]\tvalid_0's auc: 0.802965\tvalid_0's binary_logloss: 0.492444\n",
- "[27]\tvalid_0's auc: 0.80298\tvalid_0's binary_logloss: 0.491336\n",
- "[28]\tvalid_0's auc: 0.803226\tvalid_0's binary_logloss: 0.490275\n",
- "[29]\tvalid_0's auc: 0.803436\tvalid_0's binary_logloss: 0.489126\n",
- "[30]\tvalid_0's auc: 0.803796\tvalid_0's binary_logloss: 0.48802\n",
- "[31]\tvalid_0's auc: 0.803601\tvalid_0's binary_logloss: 0.486988\n",
- "[32]\tvalid_0's auc: 0.804416\tvalid_0's binary_logloss: 0.485972\n",
- "[33]\tvalid_0's auc: 0.804529\tvalid_0's binary_logloss: 0.484939\n",
- "[34]\tvalid_0's auc: 0.804534\tvalid_0's binary_logloss: 0.483927\n",
- "[35]\tvalid_0's auc: 0.804819\tvalid_0's binary_logloss: 0.483271\n",
- "[36]\tvalid_0's auc: 0.804774\tvalid_0's binary_logloss: 0.482273\n",
- "[37]\tvalid_0's auc: 0.805237\tvalid_0's binary_logloss: 0.481639\n",
- "[38]\tvalid_0's auc: 0.805546\tvalid_0's binary_logloss: 0.480959\n",
- "[39]\tvalid_0's auc: 0.805598\tvalid_0's binary_logloss: 0.479955\n",
- "[40]\tvalid_0's auc: 0.806011\tvalid_0's binary_logloss: 0.47903\n",
- "[41]\tvalid_0's auc: 0.806664\tvalid_0's binary_logloss: 0.478439\n",
- "[42]\tvalid_0's auc: 0.807021\tvalid_0's binary_logloss: 0.477798\n",
- "[43]\tvalid_0's auc: 0.80726\tvalid_0's binary_logloss: 0.476829\n",
- "[44]\tvalid_0's auc: 0.807157\tvalid_0's binary_logloss: 0.475976\n",
- "[45]\tvalid_0's auc: 0.807788\tvalid_0's binary_logloss: 0.475056\n",
- "[46]\tvalid_0's auc: 0.80805\tvalid_0's binary_logloss: 0.474446\n",
- "[47]\tvalid_0's auc: 0.808097\tvalid_0's binary_logloss: 0.473576\n",
- "[48]\tvalid_0's auc: 0.80815\tvalid_0's binary_logloss: 0.472676\n",
- "[49]\tvalid_0's auc: 0.808304\tvalid_0's binary_logloss: 0.471918\n",
- "[50]\tvalid_0's auc: 0.808749\tvalid_0's binary_logloss: 0.471481\n",
- "[51]\tvalid_0's auc: 0.808972\tvalid_0's binary_logloss: 0.471104\n",
- "[52]\tvalid_0's auc: 0.809326\tvalid_0's binary_logloss: 0.470289\n",
- "[53]\tvalid_0's auc: 0.809472\tvalid_0's binary_logloss: 0.469508\n",
- "[54]\tvalid_0's auc: 0.809505\tvalid_0's binary_logloss: 0.46869\n",
- "[55]\tvalid_0's auc: 0.809594\tvalid_0's binary_logloss: 0.467885\n",
- "[56]\tvalid_0's auc: 0.809847\tvalid_0's binary_logloss: 0.467356\n",
- "[57]\tvalid_0's auc: 0.810262\tvalid_0's binary_logloss: 0.466531\n",
- "[58]\tvalid_0's auc: 0.810407\tvalid_0's binary_logloss: 0.46573\n",
- "[59]\tvalid_0's auc: 0.810618\tvalid_0's binary_logloss: 0.465205\n",
- "[60]\tvalid_0's auc: 0.81066\tvalid_0's binary_logloss: 0.464435\n",
- "[61]\tvalid_0's auc: 0.810638\tvalid_0's binary_logloss: 0.463721\n",
- "[62]\tvalid_0's auc: 0.810658\tvalid_0's binary_logloss: 0.462982\n",
- "[63]\tvalid_0's auc: 0.811106\tvalid_0's binary_logloss: 0.462246\n",
- "[64]\tvalid_0's auc: 0.811313\tvalid_0's binary_logloss: 0.461748\n",
- "[65]\tvalid_0's auc: 0.811351\tvalid_0's binary_logloss: 0.461038\n",
- "[66]\tvalid_0's auc: 0.811433\tvalid_0's binary_logloss: 0.460323\n",
- "[67]\tvalid_0's auc: 0.81158\tvalid_0's binary_logloss: 0.459662\n",
- "[68]\tvalid_0's auc: 0.811561\tvalid_0's binary_logloss: 0.458988\n",
- "[69]\tvalid_0's auc: 0.811748\tvalid_0's binary_logloss: 0.458592\n",
- "[70]\tvalid_0's auc: 0.811919\tvalid_0's binary_logloss: 0.457934\n",
- "[71]\tvalid_0's auc: 0.812073\tvalid_0's binary_logloss: 0.457508\n",
- "[72]\tvalid_0's auc: 0.812273\tvalid_0's binary_logloss: 0.457038\n",
- "[73]\tvalid_0's auc: 0.812561\tvalid_0's binary_logloss: 0.456439\n",
- "[74]\tvalid_0's auc: 0.812633\tvalid_0's binary_logloss: 0.455789\n",
- "[75]\tvalid_0's auc: 0.812757\tvalid_0's binary_logloss: 0.455173\n",
- "[76]\tvalid_0's auc: 0.812923\tvalid_0's binary_logloss: 0.454533\n",
- "[77]\tvalid_0's auc: 0.81295\tvalid_0's binary_logloss: 0.45392\n",
- "[78]\tvalid_0's auc: 0.813073\tvalid_0's binary_logloss: 0.453517\n",
- "[79]\tvalid_0's auc: 0.813202\tvalid_0's binary_logloss: 0.452932\n",
- "[80]\tvalid_0's auc: 0.813611\tvalid_0's binary_logloss: 0.452285\n",
- "[81]\tvalid_0's auc: 0.813769\tvalid_0's binary_logloss: 0.45191\n",
- "[82]\tvalid_0's auc: 0.814468\tvalid_0's binary_logloss: 0.451455\n",
- "[83]\tvalid_0's auc: 0.814656\tvalid_0's binary_logloss: 0.450885\n",
- "[84]\tvalid_0's auc: 0.814755\tvalid_0's binary_logloss: 0.450308\n",
- "[85]\tvalid_0's auc: 0.814824\tvalid_0's binary_logloss: 0.449739\n",
- "[86]\tvalid_0's auc: 0.81499\tvalid_0's binary_logloss: 0.449348\n",
- "[87]\tvalid_0's auc: 0.815232\tvalid_0's binary_logloss: 0.448759\n",
- "[88]\tvalid_0's auc: 0.815452\tvalid_0's binary_logloss: 0.44823\n",
- "[89]\tvalid_0's auc: 0.815593\tvalid_0's binary_logloss: 0.447861\n",
- "[90]\tvalid_0's auc: 0.815591\tvalid_0's binary_logloss: 0.447323\n",
- "[91]\tvalid_0's auc: 0.815672\tvalid_0's binary_logloss: 0.446796\n",
- "[92]\tvalid_0's auc: 0.815875\tvalid_0's binary_logloss: 0.446472\n",
- "[93]\tvalid_0's auc: 0.815984\tvalid_0's binary_logloss: 0.445961\n",
- "[94]\tvalid_0's auc: 0.816026\tvalid_0's binary_logloss: 0.445439\n",
- "[95]\tvalid_0's auc: 0.816172\tvalid_0's binary_logloss: 0.444909\n",
- "[96]\tvalid_0's auc: 0.816321\tvalid_0's binary_logloss: 0.444413\n",
- "[97]\tvalid_0's auc: 0.816751\tvalid_0's binary_logloss: 0.44405\n",
- "[98]\tvalid_0's auc: 0.817226\tvalid_0's binary_logloss: 0.443626\n",
- "[99]\tvalid_0's auc: 0.817286\tvalid_0's binary_logloss: 0.443136\n",
- "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
- "[1]\tvalid_0's auc: 0.771584\tvalid_0's binary_logloss: 0.527139\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.775446\tvalid_0's binary_logloss: 0.525462\n",
- "[3]\tvalid_0's auc: 0.790092\tvalid_0's binary_logloss: 0.524461\n",
- "[4]\tvalid_0's auc: 0.791432\tvalid_0's binary_logloss: 0.523322\n",
- "[5]\tvalid_0's auc: 0.797482\tvalid_0's binary_logloss: 0.521614\n",
- "[6]\tvalid_0's auc: 0.79893\tvalid_0's binary_logloss: 0.520007\n",
- "[7]\tvalid_0's auc: 0.800753\tvalid_0's binary_logloss: 0.5187\n",
- "[8]\tvalid_0's auc: 0.802197\tvalid_0's binary_logloss: 0.517125\n",
- "[9]\tvalid_0's auc: 0.802828\tvalid_0's binary_logloss: 0.516269\n",
- "[10]\tvalid_0's auc: 0.803496\tvalid_0's binary_logloss: 0.51474\n",
- "[11]\tvalid_0's auc: 0.804972\tvalid_0's binary_logloss: 0.513321\n",
- "[12]\tvalid_0's auc: 0.804995\tvalid_0's binary_logloss: 0.512334\n",
- "[13]\tvalid_0's auc: 0.80525\tvalid_0's binary_logloss: 0.51151\n",
- "[14]\tvalid_0's auc: 0.805026\tvalid_0's binary_logloss: 0.510149\n",
- "[15]\tvalid_0's auc: 0.805622\tvalid_0's binary_logloss: 0.508708\n",
- "[16]\tvalid_0's auc: 0.806974\tvalid_0's binary_logloss: 0.507384\n",
- "[17]\tvalid_0's auc: 0.807045\tvalid_0's binary_logloss: 0.506017\n",
- "[18]\tvalid_0's auc: 0.807265\tvalid_0's binary_logloss: 0.504853\n",
- "[19]\tvalid_0's auc: 0.807126\tvalid_0's binary_logloss: 0.503972\n",
- "[20]\tvalid_0's auc: 0.806948\tvalid_0's binary_logloss: 0.502693\n",
- "[21]\tvalid_0's auc: 0.807315\tvalid_0's binary_logloss: 0.502166\n",
- "[22]\tvalid_0's auc: 0.808067\tvalid_0's binary_logloss: 0.500948\n",
- "[23]\tvalid_0's auc: 0.808226\tvalid_0's binary_logloss: 0.49987\n",
- "[24]\tvalid_0's auc: 0.808268\tvalid_0's binary_logloss: 0.498623\n",
- "[25]\tvalid_0's auc: 0.808569\tvalid_0's binary_logloss: 0.497389\n",
- "[26]\tvalid_0's auc: 0.809069\tvalid_0's binary_logloss: 0.49624\n",
- "[27]\tvalid_0's auc: 0.809312\tvalid_0's binary_logloss: 0.495095\n",
- "[28]\tvalid_0's auc: 0.809549\tvalid_0's binary_logloss: 0.494012\n",
- "[29]\tvalid_0's auc: 0.809944\tvalid_0's binary_logloss: 0.492834\n",
- "[30]\tvalid_0's auc: 0.810047\tvalid_0's binary_logloss: 0.491735\n",
- "[31]\tvalid_0's auc: 0.810086\tvalid_0's binary_logloss: 0.490633\n"
- ]
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:14.126608Z",
+ "start_time": "2020-11-18T04:21:13.493653Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序模型分组\n",
+ "trn_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
+ "g_train = trn_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
+ " g_val = val_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[32]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.489595\n",
- "[33]\tvalid_0's auc: 0.810539\tvalid_0's binary_logloss: 0.488536\n",
- "[34]\tvalid_0's auc: 0.810529\tvalid_0's binary_logloss: 0.487489\n",
- "[35]\tvalid_0's auc: 0.810932\tvalid_0's binary_logloss: 0.486775\n",
- "[36]\tvalid_0's auc: 0.810769\tvalid_0's binary_logloss: 0.48577\n",
- "[37]\tvalid_0's auc: 0.811363\tvalid_0's binary_logloss: 0.485123\n",
- "[38]\tvalid_0's auc: 0.811801\tvalid_0's binary_logloss: 0.484413\n",
- "[39]\tvalid_0's auc: 0.811987\tvalid_0's binary_logloss: 0.483371\n",
- "[40]\tvalid_0's auc: 0.812268\tvalid_0's binary_logloss: 0.482407\n",
- "[41]\tvalid_0's auc: 0.813297\tvalid_0's binary_logloss: 0.481742\n",
- "[42]\tvalid_0's auc: 0.813453\tvalid_0's binary_logloss: 0.481108\n",
- "[43]\tvalid_0's auc: 0.813603\tvalid_0's binary_logloss: 0.480163\n",
- "[44]\tvalid_0's auc: 0.813654\tvalid_0's binary_logloss: 0.479239\n",
- "[45]\tvalid_0's auc: 0.814267\tvalid_0's binary_logloss: 0.478299\n",
- "[46]\tvalid_0's auc: 0.81455\tvalid_0's binary_logloss: 0.477678\n",
- "[47]\tvalid_0's auc: 0.81452\tvalid_0's binary_logloss: 0.476766\n",
- "[48]\tvalid_0's auc: 0.814925\tvalid_0's binary_logloss: 0.475815\n",
- "[49]\tvalid_0's auc: 0.814907\tvalid_0's binary_logloss: 0.47503\n",
- "[50]\tvalid_0's auc: 0.815278\tvalid_0's binary_logloss: 0.474588\n",
- "[51]\tvalid_0's auc: 0.815535\tvalid_0's binary_logloss: 0.474171\n",
- "[52]\tvalid_0's auc: 0.815685\tvalid_0's binary_logloss: 0.473335\n",
- "[53]\tvalid_0's auc: 0.815787\tvalid_0's binary_logloss: 0.472509\n",
- "[54]\tvalid_0's auc: 0.815827\tvalid_0's binary_logloss: 0.471686\n",
- "[55]\tvalid_0's auc: 0.815871\tvalid_0's binary_logloss: 0.470838\n",
- "[56]\tvalid_0's auc: 0.816238\tvalid_0's binary_logloss: 0.470285\n",
- "[57]\tvalid_0's auc: 0.816269\tvalid_0's binary_logloss: 0.469495\n",
- "[58]\tvalid_0's auc: 0.816528\tvalid_0's binary_logloss: 0.468654\n",
- "[59]\tvalid_0's auc: 0.816706\tvalid_0's binary_logloss: 0.468122\n",
- "[60]\tvalid_0's auc: 0.816821\tvalid_0's binary_logloss: 0.467352\n",
- "[61]\tvalid_0's auc: 0.816759\tvalid_0's binary_logloss: 0.466622\n",
- "[62]\tvalid_0's auc: 0.81682\tvalid_0's binary_logloss: 0.465867\n",
- "[63]\tvalid_0's auc: 0.817251\tvalid_0's binary_logloss: 0.465112\n",
- "[64]\tvalid_0's auc: 0.817476\tvalid_0's binary_logloss: 0.464589\n",
- "[65]\tvalid_0's auc: 0.817613\tvalid_0's binary_logloss: 0.463831\n",
- "[66]\tvalid_0's auc: 0.817648\tvalid_0's binary_logloss: 0.463098\n",
- "[67]\tvalid_0's auc: 0.817719\tvalid_0's binary_logloss: 0.462414\n",
- "[68]\tvalid_0's auc: 0.817814\tvalid_0's binary_logloss: 0.461727\n",
- "[69]\tvalid_0's auc: 0.817973\tvalid_0's binary_logloss: 0.461329\n",
- "[70]\tvalid_0's auc: 0.818108\tvalid_0's binary_logloss: 0.460674\n",
- "[71]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.460222\n",
- "[72]\tvalid_0's auc: 0.818456\tvalid_0's binary_logloss: 0.45977\n",
- "[73]\tvalid_0's auc: 0.818727\tvalid_0's binary_logloss: 0.459157\n",
- "[74]\tvalid_0's auc: 0.818988\tvalid_0's binary_logloss: 0.458437\n",
- "[75]\tvalid_0's auc: 0.819144\tvalid_0's binary_logloss: 0.457808\n",
- "[76]\tvalid_0's auc: 0.819259\tvalid_0's binary_logloss: 0.457159\n",
- "[77]\tvalid_0's auc: 0.819343\tvalid_0's binary_logloss: 0.456512\n",
- "[78]\tvalid_0's auc: 0.81954\tvalid_0's binary_logloss: 0.456045\n",
- "[79]\tvalid_0's auc: 0.819687\tvalid_0's binary_logloss: 0.455416\n",
- "[80]\tvalid_0's auc: 0.819958\tvalid_0's binary_logloss: 0.454765\n",
- "[81]\tvalid_0's auc: 0.820115\tvalid_0's binary_logloss: 0.45436\n",
- "[82]\tvalid_0's auc: 0.820536\tvalid_0's binary_logloss: 0.453965\n",
- "[83]\tvalid_0's auc: 0.820649\tvalid_0's binary_logloss: 0.453383\n",
- "[84]\tvalid_0's auc: 0.820663\tvalid_0's binary_logloss: 0.452804\n",
- "[85]\tvalid_0's auc: 0.820809\tvalid_0's binary_logloss: 0.452167\n",
- "[86]\tvalid_0's auc: 0.821024\tvalid_0's binary_logloss: 0.451735\n",
- "[87]\tvalid_0's auc: 0.821124\tvalid_0's binary_logloss: 0.451167\n",
- "[88]\tvalid_0's auc: 0.821243\tvalid_0's binary_logloss: 0.45061\n",
- "[89]\tvalid_0's auc: 0.821404\tvalid_0's binary_logloss: 0.450215\n",
- "[90]\tvalid_0's auc: 0.821488\tvalid_0's binary_logloss: 0.449656\n",
- "[91]\tvalid_0's auc: 0.821538\tvalid_0's binary_logloss: 0.449107\n",
- "[92]\tvalid_0's auc: 0.82172\tvalid_0's binary_logloss: 0.448752\n",
- "[93]\tvalid_0's auc: 0.821809\tvalid_0's binary_logloss: 0.448188\n",
- "[94]\tvalid_0's auc: 0.82184\tvalid_0's binary_logloss: 0.447659\n",
- "[95]\tvalid_0's auc: 0.821971\tvalid_0's binary_logloss: 0.447108\n",
- "[96]\tvalid_0's auc: 0.822086\tvalid_0's binary_logloss: 0.446596\n",
- "[97]\tvalid_0's auc: 0.82247\tvalid_0's binary_logloss: 0.446244\n",
- "[98]\tvalid_0's auc: 0.822951\tvalid_0's binary_logloss: 0.445812\n",
- "[99]\tvalid_0's auc: 0.822991\tvalid_0's binary_logloss: 0.445329\n",
- "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
- "[1]\tvalid_0's auc: 0.769525\tvalid_0's binary_logloss: 0.526256\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.775857\tvalid_0's binary_logloss: 0.524594\n",
- "[3]\tvalid_0's auc: 0.785307\tvalid_0's binary_logloss: 0.523606\n",
- "[4]\tvalid_0's auc: 0.786356\tvalid_0's binary_logloss: 0.522495\n",
- "[5]\tvalid_0's auc: 0.793385\tvalid_0's binary_logloss: 0.520812\n",
- "[6]\tvalid_0's auc: 0.794014\tvalid_0's binary_logloss: 0.519253\n",
- "[7]\tvalid_0's auc: 0.795454\tvalid_0's binary_logloss: 0.517961\n",
- "[8]\tvalid_0's auc: 0.79807\tvalid_0's binary_logloss: 0.516363\n",
- "[9]\tvalid_0's auc: 0.798756\tvalid_0's binary_logloss: 0.51548\n",
- "[10]\tvalid_0's auc: 0.798314\tvalid_0's binary_logloss: 0.514021\n",
- "[11]\tvalid_0's auc: 0.799343\tvalid_0's binary_logloss: 0.512678\n",
- "[12]\tvalid_0's auc: 0.799573\tvalid_0's binary_logloss: 0.511708\n",
- "[13]\tvalid_0's auc: 0.799563\tvalid_0's binary_logloss: 0.510892\n",
- "[14]\tvalid_0's auc: 0.800333\tvalid_0's binary_logloss: 0.509532\n",
- "[15]\tvalid_0's auc: 0.800672\tvalid_0's binary_logloss: 0.508117\n",
- "[16]\tvalid_0's auc: 0.801953\tvalid_0's binary_logloss: 0.506866\n",
- "[17]\tvalid_0's auc: 0.802078\tvalid_0's binary_logloss: 0.5055\n",
- "[18]\tvalid_0's auc: 0.802449\tvalid_0's binary_logloss: 0.504358\n",
- "[19]\tvalid_0's auc: 0.802329\tvalid_0's binary_logloss: 0.503503\n",
- "[20]\tvalid_0's auc: 0.802437\tvalid_0's binary_logloss: 0.502233\n",
- "[21]\tvalid_0's auc: 0.802653\tvalid_0's binary_logloss: 0.50174\n",
- "[22]\tvalid_0's auc: 0.803753\tvalid_0's binary_logloss: 0.50056\n",
- "[23]\tvalid_0's auc: 0.803956\tvalid_0's binary_logloss: 0.499496\n",
- "[24]\tvalid_0's auc: 0.804231\tvalid_0's binary_logloss: 0.498283\n",
- "[25]\tvalid_0's auc: 0.804554\tvalid_0's binary_logloss: 0.497059\n",
- "[26]\tvalid_0's auc: 0.805133\tvalid_0's binary_logloss: 0.495963\n",
- "[27]\tvalid_0's auc: 0.805333\tvalid_0's binary_logloss: 0.494842\n",
- "[28]\tvalid_0's auc: 0.805644\tvalid_0's binary_logloss: 0.493771\n",
- "[29]\tvalid_0's auc: 0.806029\tvalid_0's binary_logloss: 0.492598\n",
- "[30]\tvalid_0's auc: 0.806321\tvalid_0's binary_logloss: 0.491474\n",
- "[31]\tvalid_0's auc: 0.806201\tvalid_0's binary_logloss: 0.490419\n",
- "[32]\tvalid_0's auc: 0.806671\tvalid_0's binary_logloss: 0.489393\n",
- "[33]\tvalid_0's auc: 0.806899\tvalid_0's binary_logloss: 0.488331\n",
- "[34]\tvalid_0's auc: 0.807105\tvalid_0's binary_logloss: 0.487277\n",
- "[35]\tvalid_0's auc: 0.807257\tvalid_0's binary_logloss: 0.486592\n",
- "[36]\tvalid_0's auc: 0.80729\tvalid_0's binary_logloss: 0.485607\n",
- "[37]\tvalid_0's auc: 0.807752\tvalid_0's binary_logloss: 0.484951\n",
- "[38]\tvalid_0's auc: 0.808191\tvalid_0's binary_logloss: 0.484269\n",
- "[39]\tvalid_0's auc: 0.808417\tvalid_0's binary_logloss: 0.483242\n",
- "[40]\tvalid_0's auc: 0.808761\tvalid_0's binary_logloss: 0.482291\n",
- "[41]\tvalid_0's auc: 0.80965\tvalid_0's binary_logloss: 0.48164\n",
- "[42]\tvalid_0's auc: 0.810065\tvalid_0's binary_logloss: 0.480962\n",
- "[43]\tvalid_0's auc: 0.810209\tvalid_0's binary_logloss: 0.479995\n",
- "[44]\tvalid_0's auc: 0.810091\tvalid_0's binary_logloss: 0.479077\n",
- "[45]\tvalid_0's auc: 0.810573\tvalid_0's binary_logloss: 0.478185\n",
- "[46]\tvalid_0's auc: 0.810924\tvalid_0's binary_logloss: 0.477558\n",
- "[47]\tvalid_0's auc: 0.810951\tvalid_0's binary_logloss: 0.476662\n",
- "[48]\tvalid_0's auc: 0.811101\tvalid_0's binary_logloss: 0.475745\n",
- "[49]\tvalid_0's auc: 0.811269\tvalid_0's binary_logloss: 0.474951\n",
- "[50]\tvalid_0's auc: 0.81173\tvalid_0's binary_logloss: 0.474514\n",
- "[51]\tvalid_0's auc: 0.811937\tvalid_0's binary_logloss: 0.474114\n",
- "[52]\tvalid_0's auc: 0.812136\tvalid_0's binary_logloss: 0.473297\n",
- "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.472497\n",
- "[54]\tvalid_0's auc: 0.812121\tvalid_0's binary_logloss: 0.471696\n",
- "[55]\tvalid_0's auc: 0.812164\tvalid_0's binary_logloss: 0.470905\n",
- "[56]\tvalid_0's auc: 0.812462\tvalid_0's binary_logloss: 0.470384\n",
- "[57]\tvalid_0's auc: 0.812613\tvalid_0's binary_logloss: 0.4696\n",
- "[58]\tvalid_0's auc: 0.812615\tvalid_0's binary_logloss: 0.468778\n",
- "[59]\tvalid_0's auc: 0.812842\tvalid_0's binary_logloss: 0.468211\n",
- "[60]\tvalid_0's auc: 0.81312\tvalid_0's binary_logloss: 0.467385\n",
- "[61]\tvalid_0's auc: 0.813039\tvalid_0's binary_logloss: 0.466632\n",
- "[62]\tvalid_0's auc: 0.812942\tvalid_0's binary_logloss: 0.465933\n",
- "[63]\tvalid_0's auc: 0.813274\tvalid_0's binary_logloss: 0.465214\n",
- "[64]\tvalid_0's auc: 0.813572\tvalid_0's binary_logloss: 0.464692\n",
- "[65]\tvalid_0's auc: 0.813594\tvalid_0's binary_logloss: 0.463925\n",
- "[66]\tvalid_0's auc: 0.813719\tvalid_0's binary_logloss: 0.463177\n",
- "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.462513\n",
- "[68]\tvalid_0's auc: 0.813989\tvalid_0's binary_logloss: 0.461843\n"
- ]
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:16.136151Z",
+ "start_time": "2020-11-18T04:21:16.124444Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序模型定义\n",
+ "lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) "
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[69]\tvalid_0's auc: 0.814218\tvalid_0's binary_logloss: 0.461443\n",
- "[70]\tvalid_0's auc: 0.814334\tvalid_0's binary_logloss: 0.460775\n",
- "[71]\tvalid_0's auc: 0.814493\tvalid_0's binary_logloss: 0.460332\n",
- "[72]\tvalid_0's auc: 0.814663\tvalid_0's binary_logloss: 0.459867\n",
- "[73]\tvalid_0's auc: 0.814856\tvalid_0's binary_logloss: 0.459266\n",
- "[74]\tvalid_0's auc: 0.815017\tvalid_0's binary_logloss: 0.458585\n",
- "[75]\tvalid_0's auc: 0.815186\tvalid_0's binary_logloss: 0.457958\n",
- "[76]\tvalid_0's auc: 0.815374\tvalid_0's binary_logloss: 0.457316\n",
- "[77]\tvalid_0's auc: 0.81554\tvalid_0's binary_logloss: 0.45665\n",
- "[78]\tvalid_0's auc: 0.81569\tvalid_0's binary_logloss: 0.456217\n",
- "[79]\tvalid_0's auc: 0.815861\tvalid_0's binary_logloss: 0.455615\n",
- "[80]\tvalid_0's auc: 0.816443\tvalid_0's binary_logloss: 0.454895\n",
- "[81]\tvalid_0's auc: 0.816659\tvalid_0's binary_logloss: 0.454503\n",
- "[82]\tvalid_0's auc: 0.817017\tvalid_0's binary_logloss: 0.454149\n",
- "[83]\tvalid_0's auc: 0.817162\tvalid_0's binary_logloss: 0.453578\n",
- "[84]\tvalid_0's auc: 0.817274\tvalid_0's binary_logloss: 0.452984\n",
- "[85]\tvalid_0's auc: 0.817283\tvalid_0's binary_logloss: 0.452416\n",
- "[86]\tvalid_0's auc: 0.817339\tvalid_0's binary_logloss: 0.452022\n",
- "[87]\tvalid_0's auc: 0.817494\tvalid_0's binary_logloss: 0.45146\n",
- "[88]\tvalid_0's auc: 0.817594\tvalid_0's binary_logloss: 0.450926\n",
- "[89]\tvalid_0's auc: 0.817771\tvalid_0's binary_logloss: 0.450553\n",
- "[90]\tvalid_0's auc: 0.81789\tvalid_0's binary_logloss: 0.449985\n",
- "[91]\tvalid_0's auc: 0.817931\tvalid_0's binary_logloss: 0.449439\n",
- "[92]\tvalid_0's auc: 0.818138\tvalid_0's binary_logloss: 0.449094\n",
- "[93]\tvalid_0's auc: 0.818334\tvalid_0's binary_logloss: 0.448527\n",
- "[94]\tvalid_0's auc: 0.818426\tvalid_0's binary_logloss: 0.447989\n",
- "[95]\tvalid_0's auc: 0.818676\tvalid_0's binary_logloss: 0.447407\n",
- "[96]\tvalid_0's auc: 0.818852\tvalid_0's binary_logloss: 0.446884\n",
- "[97]\tvalid_0's auc: 0.81945\tvalid_0's binary_logloss: 0.446455\n",
- "[98]\tvalid_0's auc: 0.819861\tvalid_0's binary_logloss: 0.446045\n",
- "[99]\tvalid_0's auc: 0.819943\tvalid_0's binary_logloss: 0.445543\n",
- "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
- "[1]\tvalid_0's auc: 0.770032\tvalid_0's binary_logloss: 0.527241\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.779881\tvalid_0's binary_logloss: 0.525545\n",
- "[3]\tvalid_0's auc: 0.791308\tvalid_0's binary_logloss: 0.524508\n",
- "[4]\tvalid_0's auc: 0.790788\tvalid_0's binary_logloss: 0.52341\n",
- "[5]\tvalid_0's auc: 0.795645\tvalid_0's binary_logloss: 0.521753\n",
- "[6]\tvalid_0's auc: 0.797745\tvalid_0's binary_logloss: 0.520131\n",
- "[7]\tvalid_0's auc: 0.79931\tvalid_0's binary_logloss: 0.518872\n",
- "[8]\tvalid_0's auc: 0.800014\tvalid_0's binary_logloss: 0.517353\n",
- "[9]\tvalid_0's auc: 0.800549\tvalid_0's binary_logloss: 0.516487\n",
- "[10]\tvalid_0's auc: 0.800261\tvalid_0's binary_logloss: 0.515039\n",
- "[11]\tvalid_0's auc: 0.801261\tvalid_0's binary_logloss: 0.513695\n",
- "[12]\tvalid_0's auc: 0.801062\tvalid_0's binary_logloss: 0.512735\n",
- "[13]\tvalid_0's auc: 0.801155\tvalid_0's binary_logloss: 0.51192\n",
- "[14]\tvalid_0's auc: 0.801315\tvalid_0's binary_logloss: 0.510559\n",
- "[15]\tvalid_0's auc: 0.80185\tvalid_0's binary_logloss: 0.509147\n",
- "[16]\tvalid_0's auc: 0.803029\tvalid_0's binary_logloss: 0.507914\n",
- "[17]\tvalid_0's auc: 0.803035\tvalid_0's binary_logloss: 0.506583\n",
- "[18]\tvalid_0's auc: 0.803433\tvalid_0's binary_logloss: 0.505441\n",
- "[19]\tvalid_0's auc: 0.803717\tvalid_0's binary_logloss: 0.504599\n",
- "[20]\tvalid_0's auc: 0.803819\tvalid_0's binary_logloss: 0.503327\n",
- "[21]\tvalid_0's auc: 0.803923\tvalid_0's binary_logloss: 0.502782\n",
- "[22]\tvalid_0's auc: 0.804939\tvalid_0's binary_logloss: 0.501596\n",
- "[23]\tvalid_0's auc: 0.804707\tvalid_0's binary_logloss: 0.500572\n",
- "[24]\tvalid_0's auc: 0.804632\tvalid_0's binary_logloss: 0.499367\n",
- "[25]\tvalid_0's auc: 0.804756\tvalid_0's binary_logloss: 0.498161\n",
- "[26]\tvalid_0's auc: 0.805067\tvalid_0's binary_logloss: 0.497061\n",
- "[27]\tvalid_0's auc: 0.805119\tvalid_0's binary_logloss: 0.495933\n",
- "[28]\tvalid_0's auc: 0.805304\tvalid_0's binary_logloss: 0.494849\n",
- "[29]\tvalid_0's auc: 0.805688\tvalid_0's binary_logloss: 0.493677\n",
- "[30]\tvalid_0's auc: 0.805822\tvalid_0's binary_logloss: 0.492594\n",
- "[31]\tvalid_0's auc: 0.805869\tvalid_0's binary_logloss: 0.49152\n",
- "[32]\tvalid_0's auc: 0.807267\tvalid_0's binary_logloss: 0.490435\n",
- "[33]\tvalid_0's auc: 0.807301\tvalid_0's binary_logloss: 0.489392\n",
- "[34]\tvalid_0's auc: 0.80736\tvalid_0's binary_logloss: 0.488325\n",
- "[35]\tvalid_0's auc: 0.807706\tvalid_0's binary_logloss: 0.487654\n",
- "[36]\tvalid_0's auc: 0.807758\tvalid_0's binary_logloss: 0.486651\n",
- "[37]\tvalid_0's auc: 0.808051\tvalid_0's binary_logloss: 0.486012\n",
- "[38]\tvalid_0's auc: 0.808429\tvalid_0's binary_logloss: 0.485355\n",
- "[39]\tvalid_0's auc: 0.808663\tvalid_0's binary_logloss: 0.484327\n",
- "[40]\tvalid_0's auc: 0.809007\tvalid_0's binary_logloss: 0.483386\n",
- "[41]\tvalid_0's auc: 0.809781\tvalid_0's binary_logloss: 0.482745\n",
- "[42]\tvalid_0's auc: 0.810071\tvalid_0's binary_logloss: 0.482124\n",
- "[43]\tvalid_0's auc: 0.810383\tvalid_0's binary_logloss: 0.481154\n",
- "[44]\tvalid_0's auc: 0.810446\tvalid_0's binary_logloss: 0.480243\n",
- "[45]\tvalid_0's auc: 0.811148\tvalid_0's binary_logloss: 0.479261\n",
- "[46]\tvalid_0's auc: 0.811245\tvalid_0's binary_logloss: 0.478687\n",
- "[47]\tvalid_0's auc: 0.811214\tvalid_0's binary_logloss: 0.477812\n",
- "[48]\tvalid_0's auc: 0.811408\tvalid_0's binary_logloss: 0.47689\n",
- "[49]\tvalid_0's auc: 0.811486\tvalid_0's binary_logloss: 0.476132\n",
- "[50]\tvalid_0's auc: 0.811806\tvalid_0's binary_logloss: 0.475718\n",
- "[51]\tvalid_0's auc: 0.812017\tvalid_0's binary_logloss: 0.475342\n",
- "[52]\tvalid_0's auc: 0.812255\tvalid_0's binary_logloss: 0.474505\n",
- "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.473707\n",
- "[54]\tvalid_0's auc: 0.812235\tvalid_0's binary_logloss: 0.47289\n",
- "[55]\tvalid_0's auc: 0.812233\tvalid_0's binary_logloss: 0.472091\n",
- "[56]\tvalid_0's auc: 0.812492\tvalid_0's binary_logloss: 0.471563\n",
- "[57]\tvalid_0's auc: 0.812579\tvalid_0's binary_logloss: 0.47077\n",
- "[58]\tvalid_0's auc: 0.812598\tvalid_0's binary_logloss: 0.469992\n",
- "[59]\tvalid_0's auc: 0.812885\tvalid_0's binary_logloss: 0.469458\n",
- "[60]\tvalid_0's auc: 0.812995\tvalid_0's binary_logloss: 0.468676\n",
- "[61]\tvalid_0's auc: 0.812961\tvalid_0's binary_logloss: 0.467939\n",
- "[62]\tvalid_0's auc: 0.812919\tvalid_0's binary_logloss: 0.467232\n",
- "[63]\tvalid_0's auc: 0.813291\tvalid_0's binary_logloss: 0.466491\n",
- "[64]\tvalid_0's auc: 0.813702\tvalid_0's binary_logloss: 0.465945\n",
- "[65]\tvalid_0's auc: 0.813803\tvalid_0's binary_logloss: 0.465197\n",
- "[66]\tvalid_0's auc: 0.813851\tvalid_0's binary_logloss: 0.4645\n",
- "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.463814\n",
- "[68]\tvalid_0's auc: 0.814027\tvalid_0's binary_logloss: 0.463113\n",
- "[69]\tvalid_0's auc: 0.814138\tvalid_0's binary_logloss: 0.462727\n",
- "[70]\tvalid_0's auc: 0.814365\tvalid_0's binary_logloss: 0.462077\n",
- "[71]\tvalid_0's auc: 0.814432\tvalid_0's binary_logloss: 0.461655\n",
- "[72]\tvalid_0's auc: 0.8146\tvalid_0's binary_logloss: 0.461194\n",
- "[73]\tvalid_0's auc: 0.815324\tvalid_0's binary_logloss: 0.460477\n",
- "[74]\tvalid_0's auc: 0.815411\tvalid_0's binary_logloss: 0.459805\n",
- "[75]\tvalid_0's auc: 0.815548\tvalid_0's binary_logloss: 0.459189\n",
- "[76]\tvalid_0's auc: 0.815625\tvalid_0's binary_logloss: 0.458525\n",
- "[77]\tvalid_0's auc: 0.81562\tvalid_0's binary_logloss: 0.457905\n",
- "[78]\tvalid_0's auc: 0.815786\tvalid_0's binary_logloss: 0.45747\n",
- "[79]\tvalid_0's auc: 0.815834\tvalid_0's binary_logloss: 0.456884\n",
- "[80]\tvalid_0's auc: 0.816475\tvalid_0's binary_logloss: 0.45617\n",
- "[81]\tvalid_0's auc: 0.816677\tvalid_0's binary_logloss: 0.455787\n",
- "[82]\tvalid_0's auc: 0.817255\tvalid_0's binary_logloss: 0.455358\n",
- "[83]\tvalid_0's auc: 0.817383\tvalid_0's binary_logloss: 0.454775\n",
- "[84]\tvalid_0's auc: 0.817509\tvalid_0's binary_logloss: 0.454176\n",
- "[85]\tvalid_0's auc: 0.817572\tvalid_0's binary_logloss: 0.453609\n",
- "[86]\tvalid_0's auc: 0.817721\tvalid_0's binary_logloss: 0.453213\n",
- "[87]\tvalid_0's auc: 0.817992\tvalid_0's binary_logloss: 0.452586\n",
- "[88]\tvalid_0's auc: 0.81808\tvalid_0's binary_logloss: 0.45204\n",
- "[89]\tvalid_0's auc: 0.818202\tvalid_0's binary_logloss: 0.451643\n",
- "[90]\tvalid_0's auc: 0.818336\tvalid_0's binary_logloss: 0.451081\n",
- "[91]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.450531\n",
- "[92]\tvalid_0's auc: 0.818558\tvalid_0's binary_logloss: 0.450179\n",
- "[93]\tvalid_0's auc: 0.818743\tvalid_0's binary_logloss: 0.449647\n",
- "[94]\tvalid_0's auc: 0.818789\tvalid_0's binary_logloss: 0.449133\n",
- "[95]\tvalid_0's auc: 0.818849\tvalid_0's binary_logloss: 0.44862\n",
- "[96]\tvalid_0's auc: 0.81913\tvalid_0's binary_logloss: 0.448072\n",
- "[97]\tvalid_0's auc: 0.819526\tvalid_0's binary_logloss: 0.447713\n",
- "[98]\tvalid_0's auc: 0.819971\tvalid_0's binary_logloss: 0.447296\n",
- "[99]\tvalid_0's auc: 0.819972\tvalid_0's binary_logloss: 0.446814\n"
- ]
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:22.965433Z",
+ "start_time": "2020-11-18T04:21:17.799127Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序模型训练\n",
+ "if offline:\n",
+ " lgb_ranker.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'], group=g_train,\n",
+ " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
+ " eval_group= [g_val], eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
+ "else:\n",
+ " lgb_ranker.fit(trn_user_item_feats_df[lgb_cols], trn_user_item_feats_df['label'], group=g_train)"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
- "[1]\tvalid_0's auc: 0.768646\tvalid_0's binary_logloss: 0.527167\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.779902\tvalid_0's binary_logloss: 0.525481\n",
- "[3]\tvalid_0's auc: 0.789868\tvalid_0's binary_logloss: 0.524485\n",
- "[4]\tvalid_0's auc: 0.791895\tvalid_0's binary_logloss: 0.523382\n",
- "[5]\tvalid_0's auc: 0.795453\tvalid_0's binary_logloss: 0.521759\n",
- "[6]\tvalid_0's auc: 0.796672\tvalid_0's binary_logloss: 0.520166\n",
- "[7]\tvalid_0's auc: 0.798023\tvalid_0's binary_logloss: 0.518857\n",
- "[8]\tvalid_0's auc: 0.799331\tvalid_0's binary_logloss: 0.517297\n",
- "[9]\tvalid_0's auc: 0.800181\tvalid_0's binary_logloss: 0.516416\n",
- "[10]\tvalid_0's auc: 0.800373\tvalid_0's binary_logloss: 0.514967\n",
- "[11]\tvalid_0's auc: 0.801087\tvalid_0's binary_logloss: 0.513631\n",
- "[12]\tvalid_0's auc: 0.801122\tvalid_0's binary_logloss: 0.512658\n",
- "[13]\tvalid_0's auc: 0.801043\tvalid_0's binary_logloss: 0.511833\n",
- "[14]\tvalid_0's auc: 0.801238\tvalid_0's binary_logloss: 0.510461\n",
- "[15]\tvalid_0's auc: 0.801847\tvalid_0's binary_logloss: 0.509034\n",
- "[16]\tvalid_0's auc: 0.803139\tvalid_0's binary_logloss: 0.507759\n",
- "[17]\tvalid_0's auc: 0.803577\tvalid_0's binary_logloss: 0.506361\n",
- "[18]\tvalid_0's auc: 0.803834\tvalid_0's binary_logloss: 0.505229\n",
- "[19]\tvalid_0's auc: 0.803943\tvalid_0's binary_logloss: 0.504371\n",
- "[20]\tvalid_0's auc: 0.80415\tvalid_0's binary_logloss: 0.503102\n",
- "[21]\tvalid_0's auc: 0.804446\tvalid_0's binary_logloss: 0.502564\n",
- "[22]\tvalid_0's auc: 0.805163\tvalid_0's binary_logloss: 0.501396\n",
- "[23]\tvalid_0's auc: 0.805323\tvalid_0's binary_logloss: 0.500327\n",
- "[24]\tvalid_0's auc: 0.805314\tvalid_0's binary_logloss: 0.499123\n",
- "[25]\tvalid_0's auc: 0.80535\tvalid_0's binary_logloss: 0.497927\n",
- "[26]\tvalid_0's auc: 0.805864\tvalid_0's binary_logloss: 0.496834\n",
- "[27]\tvalid_0's auc: 0.805919\tvalid_0's binary_logloss: 0.495667\n",
- "[28]\tvalid_0's auc: 0.806272\tvalid_0's binary_logloss: 0.494606\n",
- "[29]\tvalid_0's auc: 0.806599\tvalid_0's binary_logloss: 0.49343\n",
- "[30]\tvalid_0's auc: 0.806932\tvalid_0's binary_logloss: 0.492303\n",
- "[31]\tvalid_0's auc: 0.806656\tvalid_0's binary_logloss: 0.491249\n",
- "[32]\tvalid_0's auc: 0.807436\tvalid_0's binary_logloss: 0.490188\n",
- "[33]\tvalid_0's auc: 0.807629\tvalid_0's binary_logloss: 0.489117\n",
- "[34]\tvalid_0's auc: 0.807501\tvalid_0's binary_logloss: 0.48808\n",
- "[35]\tvalid_0's auc: 0.807885\tvalid_0's binary_logloss: 0.487383\n",
- "[36]\tvalid_0's auc: 0.807921\tvalid_0's binary_logloss: 0.48636\n",
- "[37]\tvalid_0's auc: 0.808267\tvalid_0's binary_logloss: 0.485724\n",
- "[38]\tvalid_0's auc: 0.808563\tvalid_0's binary_logloss: 0.485076\n",
- "[39]\tvalid_0's auc: 0.808813\tvalid_0's binary_logloss: 0.484039\n",
- "[40]\tvalid_0's auc: 0.809023\tvalid_0's binary_logloss: 0.483091\n",
- "[41]\tvalid_0's auc: 0.809782\tvalid_0's binary_logloss: 0.482441\n",
- "[42]\tvalid_0's auc: 0.810135\tvalid_0's binary_logloss: 0.48179\n",
- "[43]\tvalid_0's auc: 0.810219\tvalid_0's binary_logloss: 0.48082\n",
- "[44]\tvalid_0's auc: 0.81031\tvalid_0's binary_logloss: 0.479906\n",
- "[45]\tvalid_0's auc: 0.810514\tvalid_0's binary_logloss: 0.479024\n",
- "[46]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.478437\n",
- "[47]\tvalid_0's auc: 0.810611\tvalid_0's binary_logloss: 0.477529\n",
- "[48]\tvalid_0's auc: 0.810781\tvalid_0's binary_logloss: 0.476637\n",
- "[49]\tvalid_0's auc: 0.81089\tvalid_0's binary_logloss: 0.475883\n",
- "[50]\tvalid_0's auc: 0.811266\tvalid_0's binary_logloss: 0.475459\n",
- "[51]\tvalid_0's auc: 0.811402\tvalid_0's binary_logloss: 0.475078\n",
- "[52]\tvalid_0's auc: 0.811765\tvalid_0's binary_logloss: 0.474246\n",
- "[53]\tvalid_0's auc: 0.811891\tvalid_0's binary_logloss: 0.473452\n",
- "[54]\tvalid_0's auc: 0.811868\tvalid_0's binary_logloss: 0.47263\n",
- "[55]\tvalid_0's auc: 0.81192\tvalid_0's binary_logloss: 0.471804\n",
- "[56]\tvalid_0's auc: 0.812272\tvalid_0's binary_logloss: 0.471275\n",
- "[57]\tvalid_0's auc: 0.812639\tvalid_0's binary_logloss: 0.470396\n",
- "[58]\tvalid_0's auc: 0.812764\tvalid_0's binary_logloss: 0.469597\n",
- "[59]\tvalid_0's auc: 0.813084\tvalid_0's binary_logloss: 0.469049\n",
- "[60]\tvalid_0's auc: 0.813342\tvalid_0's binary_logloss: 0.468244\n",
- "[61]\tvalid_0's auc: 0.813302\tvalid_0's binary_logloss: 0.467499\n",
- "[62]\tvalid_0's auc: 0.813221\tvalid_0's binary_logloss: 0.466758\n",
- "[63]\tvalid_0's auc: 0.813697\tvalid_0's binary_logloss: 0.466017\n",
- "[64]\tvalid_0's auc: 0.813985\tvalid_0's binary_logloss: 0.465501\n",
- "[65]\tvalid_0's auc: 0.81416\tvalid_0's binary_logloss: 0.464725\n",
- "[66]\tvalid_0's auc: 0.814227\tvalid_0's binary_logloss: 0.46398\n",
- "[67]\tvalid_0's auc: 0.814397\tvalid_0's binary_logloss: 0.463309\n",
- "[68]\tvalid_0's auc: 0.814426\tvalid_0's binary_logloss: 0.462627\n",
- "[69]\tvalid_0's auc: 0.814593\tvalid_0's binary_logloss: 0.462244\n",
- "[70]\tvalid_0's auc: 0.814789\tvalid_0's binary_logloss: 0.461571\n",
- "[71]\tvalid_0's auc: 0.814889\tvalid_0's binary_logloss: 0.461144\n",
- "[72]\tvalid_0's auc: 0.815078\tvalid_0's binary_logloss: 0.460684\n",
- "[73]\tvalid_0's auc: 0.815439\tvalid_0's binary_logloss: 0.460063\n",
- "[74]\tvalid_0's auc: 0.815511\tvalid_0's binary_logloss: 0.459386\n",
- "[75]\tvalid_0's auc: 0.815574\tvalid_0's binary_logloss: 0.45877\n",
- "[76]\tvalid_0's auc: 0.815634\tvalid_0's binary_logloss: 0.458128\n",
- "[77]\tvalid_0's auc: 0.815618\tvalid_0's binary_logloss: 0.457495\n",
- "[78]\tvalid_0's auc: 0.81582\tvalid_0's binary_logloss: 0.457057\n",
- "[79]\tvalid_0's auc: 0.81594\tvalid_0's binary_logloss: 0.456475\n",
- "[80]\tvalid_0's auc: 0.815961\tvalid_0's binary_logloss: 0.455885\n",
- "[81]\tvalid_0's auc: 0.816153\tvalid_0's binary_logloss: 0.455511\n",
- "[82]\tvalid_0's auc: 0.816433\tvalid_0's binary_logloss: 0.455186\n",
- "[83]\tvalid_0's auc: 0.816546\tvalid_0's binary_logloss: 0.454625\n",
- "[84]\tvalid_0's auc: 0.816586\tvalid_0's binary_logloss: 0.454039\n",
- "[85]\tvalid_0's auc: 0.816584\tvalid_0's binary_logloss: 0.453482\n",
- "[86]\tvalid_0's auc: 0.816881\tvalid_0's binary_logloss: 0.453048\n",
- "[87]\tvalid_0's auc: 0.817029\tvalid_0's binary_logloss: 0.452485\n",
- "[88]\tvalid_0's auc: 0.81707\tvalid_0's binary_logloss: 0.451941\n",
- "[89]\tvalid_0's auc: 0.817298\tvalid_0's binary_logloss: 0.451544\n",
- "[90]\tvalid_0's auc: 0.817343\tvalid_0's binary_logloss: 0.450975\n",
- "[91]\tvalid_0's auc: 0.817357\tvalid_0's binary_logloss: 0.450422\n",
- "[92]\tvalid_0's auc: 0.817592\tvalid_0's binary_logloss: 0.450109\n",
- "[93]\tvalid_0's auc: 0.817729\tvalid_0's binary_logloss: 0.449542\n",
- "[94]\tvalid_0's auc: 0.817834\tvalid_0's binary_logloss: 0.448982\n",
- "[95]\tvalid_0's auc: 0.81809\tvalid_0's binary_logloss: 0.448398\n",
- "[96]\tvalid_0's auc: 0.818269\tvalid_0's binary_logloss: 0.447908\n",
- "[97]\tvalid_0's auc: 0.818682\tvalid_0's binary_logloss: 0.447547\n",
- "[98]\tvalid_0's auc: 0.819015\tvalid_0's binary_logloss: 0.447165\n",
- "[99]\tvalid_0's auc: 0.819016\tvalid_0's binary_logloss: 0.446669\n",
- "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n"
- ]
- }
- ],
- "source": [
- "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
- "# 这一部分与前面的单独训练和验证是分开的\n",
- "def get_kfold_users(trn_df, n=5):\n",
- " user_ids = trn_df['user_id'].unique()\n",
- " user_set = [user_ids[i::n] for i in range(n)]\n",
- " return user_set\n",
- "\n",
- "k_fold = 5\n",
- "trn_df = trn_user_item_feats_df_rank_model\n",
- "user_set = get_kfold_users(trn_df, n=k_fold)\n",
- "\n",
- "score_list = []\n",
- "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
- "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
- "\n",
- "# 五折交叉验证,并将中间结果保存用于staking\n",
- "for n_fold, valid_user in enumerate(user_set):\n",
- " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
- " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
- " \n",
- " # 模型及参数的定义\n",
- " lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) \n",
- " # 训练模型\n",
- " lgb_Classfication.fit(train_idx[lgb_cols], train_idx['label'],eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], \n",
- " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
- " \n",
- " # 预测验证集结果\n",
- " valid_idx['pred_score'] = lgb_Classfication.predict_proba(valid_idx[lgb_cols], \n",
- " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
- " \n",
- " # 对输出结果进行归一化 分类模型输出的值本身就是一个概率值不需要进行归一化\n",
- " # valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
- " \n",
- " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
- " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
- " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
- " \n",
- " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
- " if not offline:\n",
- " sub_preds += lgb_Classfication.predict_proba(tst_user_item_feats_df_rank_model[lgb_cols], \n",
- " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
- " \n",
- "score_df_ = pd.concat(score_list, axis=0)\n",
- "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
- "# 保存训练集交叉验证产生的新特征\n",
- "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_cls_feats.csv', index=False)\n",
- " \n",
- "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
- "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
- "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- "\n",
- "# 保存测试集交叉验证的新特征\n",
- "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_cls_feats.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:24:23.074237Z",
- "start_time": "2020-11-18T04:24:13.812284Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_cls')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## DIN模型"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户的历史点击行为列表\n",
- "这个是为后面的DIN模型服务的"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:24:30.508213Z",
- "start_time": "2020-11-18T04:24:27.426372Z"
- }
- },
- "outputs": [],
- "source": [
- "if offline:\n",
- " all_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
- "else:\n",
- " trn_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
- " tst_data = pd.read_csv('./data_raw/testA_click_log.csv')\n",
- " all_data = trn_data.append(tst_data)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:25:28.082071Z",
- "start_time": "2020-11-18T04:24:33.649524Z"
- }
- },
- "outputs": [],
- "source": [
- "hist_click =all_data[['user_id', 'click_article_id']].groupby('user_id').agg({list}).reset_index()\n",
- "his_behavior_df = pd.DataFrame()\n",
- "his_behavior_df['user_id'] = hist_click['user_id']\n",
- "his_behavior_df['hist_click_article_id'] = hist_click['click_article_id']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:25:52.925866Z",
- "start_time": "2020-11-18T04:25:52.863922Z"
- }
- },
- "outputs": [],
- "source": [
- "trn_user_item_feats_df_din_model = trn_user_item_feats_df.copy()\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_din_model = val_user_item_feats_df.copy()\n",
- "else: \n",
- " val_user_item_feats_df_din_model = None\n",
- " \n",
- "tst_user_item_feats_df_din_model = tst_user_item_feats_df.copy()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:00.070681Z",
- "start_time": "2020-11-18T04:25:56.417197Z"
- }
- },
- "outputs": [],
- "source": [
- "trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
- "else:\n",
- " val_user_item_feats_df_din_model = None\n",
- "\n",
- "tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### DIN模型简介\n",
- "我们下面尝试使用DIN模型, DIN的全称是Deep Interest Network, 这是阿里2018年基于前面的深度学习模型无法表达用户多样化的兴趣而提出的一个模型, 它可以通过考虑【给定的候选广告】和【用户的历史行为】的相关性,来计算用户兴趣的表示向量。具体来说就是通过引入局部激活单元,通过软搜索历史行为的相关部分来关注相关的用户兴趣,并采用加权和来获得有关候选广告的用户兴趣的表示。与候选广告相关性较高的行为会获得较高的激活权重,并支配着用户兴趣。该表示向量在不同广告上有所不同,大大提高了模型的表达能力。所以该模型对于此次新闻推荐的任务也比较适合, 我们在这里通过当前的候选文章与用户历史点击文章的相关性来计算用户对于文章的兴趣。 该模型的结构如下:\n",
- "\n",
- "![image-20201116201646983](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)\n",
- "\n",
- "\n",
- "我们这里直接调包来使用这个模型, 关于这个模型的详细细节部分我们会在下一期的推荐系统组队学习中给出。下面说一下该模型如何具体使用:deepctr的函数原型如下:\n",
- "> def DIN(dnn_feature_columns, history_feature_list, dnn_use_bn=False,\n",
- "> dnn_hidden_units=(200, 80), dnn_activation='relu', att_hidden_size=(80, 40), att_activation=\"dice\",\n",
- "> att_weight_normalization=False, l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, seed=1024,\n",
- "> task='binary'):\n",
- "> \n",
- "> * dnn_feature_columns: 特征列, 包含数据所有特征的列表\n",
- "> * history_feature_list: 用户历史行为列, 反应用户历史行为的特征的列表\n",
- "> * dnn_use_bn: 是否使用BatchNormalization\n",
- "> * dnn_hidden_units: 全连接层网络的层数和每一层神经元的个数, 一个列表或者元组\n",
- "> * dnn_activation_relu: 全连接网络的激活单元类型\n",
- "> * att_hidden_size: 注意力层的全连接网络的层数和每一层神经元的个数\n",
- "> * att_activation: 注意力层的激活单元类型\n",
- "> * att_weight_normalization: 是否归一化注意力得分\n",
- "> * l2_reg_dnn: 全连接网络的正则化系数\n",
- "> * l2_reg_embedding: embedding向量的正则化稀疏\n",
- "> * dnn_dropout: 全连接网络的神经元的失活概率\n",
- "> * task: 任务, 可以是分类, 也可是是回归\n",
- "\n",
- "在具体使用的时候, 我们必须要传入特征列和历史行为列, 但是再传入之前, 我们需要进行一下特征列的预处理。具体如下:\n",
- "\n",
- "1. 首先,我们要处理数据集, 得到数据, 由于我们是基于用户过去的行为去预测用户是否点击当前文章, 所以我们需要把数据的特征列划分成数值型特征, 离散型特征和历史行为特征列三部分, 对于每一部分, DIN模型的处理会有不同\n",
- " 1. 对于离散型特征, 在我们的数据集中就是那些类别型的特征, 比如user_id这种, 这种类别型特征, 我们首先要经过embedding处理得到每个特征的低维稠密型表示, 既然要经过embedding, 那么我们就需要为每一列的类别特征的取值建立一个字典,并指明embedding维度, 所以在使用deepctr的DIN模型准备数据的时候, 我们需要通过SparseFeat函数指明这些类别型特征, 这个函数的传入参数就是列名, 列的唯一取值(建立字典用)和embedding维度。\n",
- " 2. 对于用户历史行为特征列, 比如文章id, 文章的类别等这种, 同样的我们需要先经过embedding处理, 只不过和上面不一样的地方是,对于这种特征, 我们在得到每个特征的embedding表示之后, 还需要通过一个Attention_layer计算用户的历史行为和当前候选文章的相关性以此得到当前用户的embedding向量, 这个向量就可以基于当前的候选文章与用户过去点击过得历史文章的相似性的程度来反应用户的兴趣, 并且随着用户的不同的历史点击来变化,去动态的模拟用户兴趣的变化过程。这类特征对于每个用户都是一个历史行为序列, 对于每个用户, 历史行为序列长度会不一样, 可能有的用户点击的历史文章多,有的点击的历史文章少, 所以我们还需要把这个长度统一起来, 在为DIN模型准备数据的时候, 我们首先要通过SparseFeat函数指明这些类别型特征, 然后还需要通过VarLenSparseFeat函数再进行序列填充, 使得每个用户的历史序列一样长, 所以这个函数参数中会有个maxlen,来指明序列的最大长度是多少。\n",
- " 3. 对于连续型特征列, 我们只需要用DenseFeat函数来指明列名和维度即可。\n",
- "2. 处理完特征列之后, 我们把相应的数据与列进行对应,就得到了最后的数据。\n",
- "\n",
- "下面根据具体的代码感受一下, 逻辑是这样, 首先我们需要写一个数据准备函数, 在这里面就是根据上面的具体步骤准备数据, 得到数据和特征列, 然后就是建立DIN模型并训练, 最后基于模型进行测试。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:08.405211Z",
- "start_time": "2020-11-18T04:26:04.887013Z"
- }
- },
- "outputs": [],
- "source": [
- "# 导入deepctr\n",
- "from deepctr.models import DIN\n",
- "from deepctr.feature_column import SparseFeat, VarLenSparseFeat, DenseFeat, get_feature_names\n",
- "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
- "\n",
- "from tensorflow.keras import backend as K\n",
- "from tensorflow.keras.layers import *\n",
- "from tensorflow.keras.models import *\n",
- "from tensorflow.keras.callbacks import * \n",
- "import tensorflow as tf\n",
- "\n",
- "import os\n",
- "os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
- "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"2\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:13.485712Z",
- "start_time": "2020-11-18T04:26:13.476042Z"
- }
- },
- "outputs": [],
- "source": [
- "# 数据准备函数\n",
- "def get_din_feats_columns(df, dense_fea, sparse_fea, behavior_fea, his_behavior_fea, emb_dim=32, max_len=100):\n",
- " \"\"\"\n",
- " 数据准备函数:\n",
- " df: 数据集\n",
- " dense_fea: 数值型特征列\n",
- " sparse_fea: 离散型特征列\n",
- " behavior_fea: 用户的候选行为特征列\n",
- " his_behavior_fea: 用户的历史行为特征列\n",
- " embedding_dim: embedding的维度, 这里为了简单, 统一把离散型特征列采用一样的隐向量维度\n",
- " max_len: 用户序列的最大长度\n",
- " \"\"\"\n",
- " \n",
- " sparse_feature_columns = [SparseFeat(feat, vocabulary_size=df[feat].nunique() + 1, embedding_dim=emb_dim) for feat in sparse_fea]\n",
- " \n",
- " dense_feature_columns = [DenseFeat(feat, 1, ) for feat in dense_fea]\n",
- " \n",
- " var_feature_columns = [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=df['click_article_id'].nunique() + 1,\n",
- " embedding_dim=emb_dim, embedding_name='click_article_id'), maxlen=max_len) for feat in hist_behavior_fea]\n",
- " \n",
- " dnn_feature_columns = sparse_feature_columns + dense_feature_columns + var_feature_columns\n",
- " \n",
- " # 建立x, x是一个字典的形式\n",
- " x = {}\n",
- " for name in get_feature_names(dnn_feature_columns):\n",
- " if name in his_behavior_fea:\n",
- " # 这是历史行为序列\n",
- " his_list = [l for l in df[name]]\n",
- " x[name] = pad_sequences(his_list, maxlen=max_len, padding='post') # 二维数组\n",
- " else:\n",
- " x[name] = df[name].values\n",
- " \n",
- " return x, dnn_feature_columns"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:18.783217Z",
- "start_time": "2020-11-18T04:26:18.776795Z"
- }
- },
- "outputs": [],
- "source": [
- "# 把特征分开\n",
- "sparse_fea = ['user_id', 'click_article_id', 'category_id', 'click_environment', 'click_deviceGroup', \n",
- " 'click_os', 'click_country', 'click_region', 'click_referrer_type', 'is_cat_hab']\n",
- "\n",
- "behavior_fea = ['click_article_id']\n",
- "\n",
- "hist_behavior_fea = ['hist_click_article_id']\n",
- "\n",
- "dense_fea = ['sim0', 'time_diff0', 'word_diff0', 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score',\n",
- " 'rank','click_size','time_diff_mean','active_level','user_time_hob1','user_time_hob2',\n",
- " 'words_hbo','words_count']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:25.469810Z",
- "start_time": "2020-11-18T04:26:24.779347Z"
- }
- },
- "outputs": [],
- "source": [
- "# dense特征进行归一化, 神经网络训练都需要将数值进行归一化处理\n",
- "mm = MinMaxScaler()\n",
- "\n",
- "# 下面是做一些特殊处理,当在其他的地方出现无效值的时候,不处理无法进行归一化,刚开始可以先把他注释掉,在运行了下面的代码\n",
- "# 之后如果发现报错,应该先去想办法处理如何不出现inf之类的值\n",
- "# trn_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
- "# tst_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
- "\n",
- "for feat in dense_fea:\n",
- " trn_user_item_feats_df_din_model[feat] = mm.fit_transform(trn_user_item_feats_df_din_model[[feat]])\n",
- " \n",
- " if val_user_item_feats_df_din_model is not None:\n",
- " val_user_item_feats_df_din_model[feat] = mm.fit_transform(val_user_item_feats_df_din_model[[feat]])\n",
- " \n",
- " tst_user_item_feats_df_din_model[feat] = mm.fit_transform(tst_user_item_feats_df_din_model[[feat]])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:36.727753Z",
- "start_time": "2020-11-18T04:26:28.854705Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:28.616665Z",
+ "start_time": "2020-11-18T04:21:24.672280Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型预测\n",
+ "tst_user_item_feats_df['pred_score'] = lgb_ranker.predict(tst_user_item_feats_df[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
+ "\n",
+ "# 将这里的排序结果保存一份,用户后面的模型融合\n",
+ "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_ranker_score.csv', index=False)"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n"
- ]
- }
- ],
- "source": [
- "# 准备训练数据\n",
- "x_trn, dnn_feature_columns = get_din_feats_columns(trn_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- "y_trn = trn_user_item_feats_df_din_model['label'].values\n",
- "\n",
- "if offline:\n",
- " # 准备验证数据\n",
- " x_val, dnn_feature_columns = get_din_feats_columns(val_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- " y_val = val_user_item_feats_df_din_model['label'].values\n",
- " \n",
- "dense_fea = [x for x in dense_fea if x != 'label']\n",
- "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:45.146318Z",
- "start_time": "2020-11-18T04:26:40.423914Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:40.253692Z",
+ "start_time": "2020-11-18T04:21:30.546587Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_ranker')"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:255: add_dispatch_support.
.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Use tf.where in 2.0, which has the same broadcast rule as np.where\n",
- "Model: \"model\"\n",
- "__________________________________________________________________________________________________\n",
- "Layer (type) Output Shape Param # Connected to \n",
- "==================================================================================================\n",
- "user_id (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_article_id (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "category_id (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_environment (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_deviceGroup (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_os (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_country (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_region (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_referrer_type (InputLayer [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "is_cat_hab (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_user_id (Embedding) (None, 1, 32) 1600032 user_id[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_seq_emb_hist_click_artic multiple 525664 click_article_id[0][0] \n",
- " hist_click_article_id[0][0] \n",
- " click_article_id[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_category_id (Embeddi (None, 1, 32) 7776 category_id[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_environment (E (None, 1, 32) 128 click_environment[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_deviceGroup (E (None, 1, 32) 160 click_deviceGroup[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_os (Embedding) (None, 1, 32) 288 click_os[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_country (Embed (None, 1, 32) 384 click_country[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_region (Embedd (None, 1, 32) 928 click_region[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_referrer_type (None, 1, 32) 256 click_referrer_type[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_is_cat_hab (Embeddin (None, 1, 32) 64 is_cat_hab[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask (NoMask) (None, 1, 32) 0 sparse_emb_user_id[0][0] \n",
- " sparse_seq_emb_hist_click_article\n",
- " sparse_emb_category_id[0][0] \n",
- " sparse_emb_click_environment[0][0\n",
- " sparse_emb_click_deviceGroup[0][0\n",
- " sparse_emb_click_os[0][0] \n",
- " sparse_emb_click_country[0][0] \n",
- " sparse_emb_click_region[0][0] \n",
- " sparse_emb_click_referrer_type[0]\n",
- " sparse_emb_is_cat_hab[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "hist_click_article_id (InputLay [(None, 50)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "concatenate (Concatenate) (None, 1, 320) 0 no_mask[0][0] \n",
- " no_mask[1][0] \n",
- " no_mask[2][0] \n",
- " no_mask[3][0] \n",
- " no_mask[4][0] \n",
- " no_mask[5][0] \n",
- " no_mask[6][0] \n",
- " no_mask[7][0] \n",
- " no_mask[8][0] \n",
- " no_mask[9][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_1 (NoMask) (None, 1, 320) 0 concatenate[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "attention_sequence_pooling_laye (None, 1, 32) 13961 sparse_seq_emb_hist_click_article\n",
- " sparse_seq_emb_hist_click_article\n",
- "__________________________________________________________________________________________________\n",
- "concatenate_1 (Concatenate) (None, 1, 352) 0 no_mask_1[0][0] \n",
- " attention_sequence_pooling_layer[\n",
- "__________________________________________________________________________________________________\n",
- "sim0 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "time_diff0 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "word_diff0 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_max (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_min (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_sum (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_mean (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "score (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "rank (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_size (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "time_diff_mean (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "active_level (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "user_time_hob1 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "user_time_hob2 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "words_hbo (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "words_count (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "flatten (Flatten) (None, 352) 0 concatenate_1[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_3 (NoMask) (None, 1) 0 sim0[0][0] \n",
- " time_diff0[0][0] \n",
- " word_diff0[0][0] \n",
- " sim_max[0][0] \n",
- " sim_min[0][0] \n",
- " sim_sum[0][0] \n",
- " sim_mean[0][0] \n",
- " score[0][0] \n",
- " rank[0][0] \n",
- " click_size[0][0] \n",
- " time_diff_mean[0][0] \n",
- " active_level[0][0] \n",
- " user_time_hob1[0][0] \n",
- " user_time_hob2[0][0] \n",
- " words_hbo[0][0] \n",
- " words_count[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_2 (NoMask) (None, 352) 0 flatten[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "concatenate_2 (Concatenate) (None, 16) 0 no_mask_3[0][0] \n",
- " no_mask_3[1][0] \n",
- " no_mask_3[2][0] \n",
- " no_mask_3[3][0] \n",
- " no_mask_3[4][0] \n",
- " no_mask_3[5][0] \n",
- " no_mask_3[6][0] \n",
- " no_mask_3[7][0] \n",
- " no_mask_3[8][0] \n",
- " no_mask_3[9][0] \n",
- " no_mask_3[10][0] \n",
- " no_mask_3[11][0] \n",
- " no_mask_3[12][0] \n",
- " no_mask_3[13][0] \n",
- " no_mask_3[14][0] \n",
- " no_mask_3[15][0] \n",
- "__________________________________________________________________________________________________\n",
- "flatten_1 (Flatten) (None, 352) 0 no_mask_2[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "flatten_2 (Flatten) (None, 16) 0 concatenate_2[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_4 (NoMask) multiple 0 flatten_1[0][0] \n",
- " flatten_2[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "concatenate_3 (Concatenate) (None, 368) 0 no_mask_4[0][0] \n",
- " no_mask_4[1][0] \n",
- "__________________________________________________________________________________________________\n",
- "dnn_1 (DNN) (None, 80) 89880 concatenate_3[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "dense (Dense) (None, 1) 80 dnn_1[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "prediction_layer (PredictionLay (None, 1) 1 dense[0][0] \n",
- "==================================================================================================\n",
- "Total params: 2,239,602\n",
- "Trainable params: 2,239,362\n",
- "Non-trainable params: 240\n",
- "__________________________________________________________________________________________________\n"
- ]
- }
- ],
- "source": [
- "# 建立模型\n",
- "model = DIN(dnn_feature_columns, behavior_fea)\n",
- "\n",
- "# 查看模型结构\n",
- "model.summary()\n",
- "\n",
- "# 模型编译\n",
- "model.compile('adam', 'binary_crossentropy',metrics=['binary_crossentropy', tf.keras.metrics.AUC()])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:28:43.885773Z",
- "start_time": "2020-11-18T04:26:48.746787Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:22:26.195838Z",
+ "start_time": "2020-11-18T04:21:46.115002Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[1]\tvalid_0's ndcg@1: 0.909975\tvalid_0's ndcg@2: 0.963068\tvalid_0's ndcg@3: 0.96533\tvalid_0's ndcg@4: 0.965729\tvalid_0's ndcg@5: 0.965864\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9143\tvalid_0's ndcg@2: 0.964711\tvalid_0's ndcg@3: 0.966961\tvalid_0's ndcg@4: 0.967338\tvalid_0's ndcg@5: 0.967483\n",
+ "[3]\tvalid_0's ndcg@1: 0.9181\tvalid_0's ndcg@2: 0.966114\tvalid_0's ndcg@3: 0.968289\tvalid_0's ndcg@4: 0.968773\tvalid_0's ndcg@5: 0.96887\n",
+ "[4]\tvalid_0's ndcg@1: 0.925575\tvalid_0's ndcg@2: 0.969093\tvalid_0's ndcg@3: 0.971193\tvalid_0's ndcg@4: 0.971603\tvalid_0's ndcg@5: 0.97169\n",
+ "[5]\tvalid_0's ndcg@1: 0.9267\tvalid_0's ndcg@2: 0.969635\tvalid_0's ndcg@3: 0.97166\tvalid_0's ndcg@4: 0.972037\tvalid_0's ndcg@5: 0.972133\n",
+ "[6]\tvalid_0's ndcg@1: 0.927\tvalid_0's ndcg@2: 0.969682\tvalid_0's ndcg@3: 0.971757\tvalid_0's ndcg@4: 0.972134\tvalid_0's ndcg@5: 0.972231\n",
+ "[7]\tvalid_0's ndcg@1: 0.928825\tvalid_0's ndcg@2: 0.970451\tvalid_0's ndcg@3: 0.972476\tvalid_0's ndcg@4: 0.97282\tvalid_0's ndcg@5: 0.972927\n",
+ "[8]\tvalid_0's ndcg@1: 0.930025\tvalid_0's ndcg@2: 0.970988\tvalid_0's ndcg@3: 0.972951\tvalid_0's ndcg@4: 0.973295\tvalid_0's ndcg@5: 0.973402\n",
+ "[9]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971347\tvalid_0's ndcg@3: 0.973384\tvalid_0's ndcg@4: 0.973707\tvalid_0's ndcg@5: 0.973794\n",
+ "[10]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.971385\tvalid_0's ndcg@3: 0.973372\tvalid_0's ndcg@4: 0.973717\tvalid_0's ndcg@5: 0.973794\n",
+ "[11]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.971433\tvalid_0's ndcg@3: 0.973333\tvalid_0's ndcg@4: 0.973699\tvalid_0's ndcg@5: 0.973767\n",
+ "[12]\tvalid_0's ndcg@1: 0.93145\tvalid_0's ndcg@2: 0.971656\tvalid_0's ndcg@3: 0.973493\tvalid_0's ndcg@4: 0.973881\tvalid_0's ndcg@5: 0.973949\n",
+ "[13]\tvalid_0's ndcg@1: 0.932525\tvalid_0's ndcg@2: 0.971927\tvalid_0's ndcg@3: 0.973839\tvalid_0's ndcg@4: 0.974227\tvalid_0's ndcg@5: 0.974304\n",
+ "[14]\tvalid_0's ndcg@1: 0.932575\tvalid_0's ndcg@2: 0.971898\tvalid_0's ndcg@3: 0.973823\tvalid_0's ndcg@4: 0.974243\tvalid_0's ndcg@5: 0.97432\n",
+ "[15]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972239\tvalid_0's ndcg@3: 0.974189\tvalid_0's ndcg@4: 0.974587\tvalid_0's ndcg@5: 0.974665\n",
+ "[16]\tvalid_0's ndcg@1: 0.933475\tvalid_0's ndcg@2: 0.972309\tvalid_0's ndcg@3: 0.974209\tvalid_0's ndcg@4: 0.974596\tvalid_0's ndcg@5: 0.974674\n",
+ "[17]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972369\tvalid_0's ndcg@3: 0.974307\tvalid_0's ndcg@4: 0.974684\tvalid_0's ndcg@5: 0.974761\n",
+ "[18]\tvalid_0's ndcg@1: 0.9339\tvalid_0's ndcg@2: 0.972497\tvalid_0's ndcg@3: 0.974372\tvalid_0's ndcg@4: 0.974749\tvalid_0's ndcg@5: 0.974836\n",
+ "[19]\tvalid_0's ndcg@1: 0.9345\tvalid_0's ndcg@2: 0.972845\tvalid_0's ndcg@3: 0.974645\tvalid_0's ndcg@4: 0.974979\tvalid_0's ndcg@5: 0.975085\n",
+ "[20]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.97484\tvalid_0's ndcg@4: 0.975174\tvalid_0's ndcg@5: 0.975271\n",
+ "[21]\tvalid_0's ndcg@1: 0.935\tvalid_0's ndcg@2: 0.973092\tvalid_0's ndcg@3: 0.97488\tvalid_0's ndcg@4: 0.975192\tvalid_0's ndcg@5: 0.975289\n",
+ "[22]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.974988\tvalid_0's ndcg@4: 0.975289\tvalid_0's ndcg@5: 0.975386\n",
+ "[23]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974824\tvalid_0's ndcg@4: 0.975136\tvalid_0's ndcg@5: 0.975223\n",
+ "[24]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973274\tvalid_0's ndcg@3: 0.975087\tvalid_0's ndcg@4: 0.975388\tvalid_0's ndcg@5: 0.975475\n",
+ "[25]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973345\tvalid_0's ndcg@3: 0.97512\tvalid_0's ndcg@4: 0.975443\tvalid_0's ndcg@5: 0.97553\n",
+ "[26]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.975\tvalid_0's ndcg@4: 0.975313\tvalid_0's ndcg@5: 0.9754\n",
+ "[27]\tvalid_0's ndcg@1: 0.935175\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.974983\tvalid_0's ndcg@4: 0.975295\tvalid_0's ndcg@5: 0.975382\n",
+ "[28]\tvalid_0's ndcg@1: 0.935425\tvalid_0's ndcg@2: 0.973328\tvalid_0's ndcg@3: 0.975041\tvalid_0's ndcg@4: 0.975374\tvalid_0's ndcg@5: 0.975471\n",
+ "[29]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973225\tvalid_0's ndcg@3: 0.974963\tvalid_0's ndcg@4: 0.975297\tvalid_0's ndcg@5: 0.975403\n",
+ "[30]\tvalid_0's ndcg@1: 0.9353\tvalid_0's ndcg@2: 0.973235\tvalid_0's ndcg@3: 0.97501\tvalid_0's ndcg@4: 0.975311\tvalid_0's ndcg@5: 0.975418\n",
+ "[31]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973361\tvalid_0's ndcg@3: 0.975099\tvalid_0's ndcg@4: 0.975422\tvalid_0's ndcg@5: 0.975528\n",
+ "[32]\tvalid_0's ndcg@1: 0.9364\tvalid_0's ndcg@2: 0.973641\tvalid_0's ndcg@3: 0.975391\tvalid_0's ndcg@4: 0.975714\tvalid_0's ndcg@5: 0.97582\n",
+ "[33]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973751\tvalid_0's ndcg@3: 0.975501\tvalid_0's ndcg@4: 0.975824\tvalid_0's ndcg@5: 0.975931\n",
+ "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.973902\tvalid_0's ndcg@3: 0.975677\tvalid_0's ndcg@4: 0.975989\tvalid_0's ndcg@5: 0.976095\n",
+ "[35]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974105\tvalid_0's ndcg@3: 0.975892\tvalid_0's ndcg@4: 0.976194\tvalid_0's ndcg@5: 0.9763\n",
+ "[36]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974184\tvalid_0's ndcg@3: 0.975984\tvalid_0's ndcg@4: 0.976296\tvalid_0's ndcg@5: 0.976402\n",
+ "[37]\tvalid_0's ndcg@1: 0.93845\tvalid_0's ndcg@2: 0.974366\tvalid_0's ndcg@3: 0.976166\tvalid_0's ndcg@4: 0.976467\tvalid_0's ndcg@5: 0.976574\n",
+ "[38]\tvalid_0's ndcg@1: 0.938925\tvalid_0's ndcg@2: 0.974557\tvalid_0's ndcg@3: 0.976332\tvalid_0's ndcg@4: 0.976655\tvalid_0's ndcg@5: 0.976751\n",
+ "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974471\tvalid_0's ndcg@3: 0.976234\tvalid_0's ndcg@4: 0.976557\tvalid_0's ndcg@5: 0.976653\n",
+ "[40]\tvalid_0's ndcg@1: 0.938325\tvalid_0's ndcg@2: 0.974335\tvalid_0's ndcg@3: 0.97611\tvalid_0's ndcg@4: 0.976433\tvalid_0's ndcg@5: 0.97653\n",
+ "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.974669\tvalid_0's ndcg@3: 0.976431\tvalid_0's ndcg@4: 0.976743\tvalid_0's ndcg@5: 0.97683\n",
+ "[42]\tvalid_0's ndcg@1: 0.939375\tvalid_0's ndcg@2: 0.974833\tvalid_0's ndcg@3: 0.976546\tvalid_0's ndcg@4: 0.976858\tvalid_0's ndcg@5: 0.976945\n",
+ "[43]\tvalid_0's ndcg@1: 0.939625\tvalid_0's ndcg@2: 0.974878\tvalid_0's ndcg@3: 0.976628\tvalid_0's ndcg@4: 0.97694\tvalid_0's ndcg@5: 0.977027\n",
+ "[44]\tvalid_0's ndcg@1: 0.9395\tvalid_0's ndcg@2: 0.974832\tvalid_0's ndcg@3: 0.97657\tvalid_0's ndcg@4: 0.976893\tvalid_0's ndcg@5: 0.97698\n",
+ "[45]\tvalid_0's ndcg@1: 0.939775\tvalid_0's ndcg@2: 0.974949\tvalid_0's ndcg@3: 0.976674\tvalid_0's ndcg@4: 0.976997\tvalid_0's ndcg@5: 0.977084\n",
+ "[46]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.974945\tvalid_0's ndcg@3: 0.976708\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977107\n",
+ "[47]\tvalid_0's ndcg@1: 0.94005\tvalid_0's ndcg@2: 0.975004\tvalid_0's ndcg@3: 0.976766\tvalid_0's ndcg@4: 0.977078\tvalid_0's ndcg@5: 0.977175\n",
+ "[48]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
+ "[49]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
+ "[50]\tvalid_0's ndcg@1: 0.9405\tvalid_0's ndcg@2: 0.975264\tvalid_0's ndcg@3: 0.976989\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
+ "[51]\tvalid_0's ndcg@1: 0.941125\tvalid_0's ndcg@2: 0.975526\tvalid_0's ndcg@3: 0.977226\tvalid_0's ndcg@4: 0.977528\tvalid_0's ndcg@5: 0.977605\n",
+ "[52]\tvalid_0's ndcg@1: 0.941\tvalid_0's ndcg@2: 0.97548\tvalid_0's ndcg@3: 0.977193\tvalid_0's ndcg@4: 0.977484\tvalid_0's ndcg@5: 0.977561\n",
+ "[53]\tvalid_0's ndcg@1: 0.9411\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.977259\tvalid_0's ndcg@4: 0.977539\tvalid_0's ndcg@5: 0.977616\n",
+ "[54]\tvalid_0's ndcg@1: 0.9412\tvalid_0's ndcg@2: 0.975712\tvalid_0's ndcg@3: 0.977299\tvalid_0's ndcg@4: 0.97759\tvalid_0's ndcg@5: 0.977667\n",
+ "[55]\tvalid_0's ndcg@1: 0.94155\tvalid_0's ndcg@2: 0.975841\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977719\tvalid_0's ndcg@5: 0.977797\n",
+ "[56]\tvalid_0's ndcg@1: 0.941825\tvalid_0's ndcg@2: 0.975943\tvalid_0's ndcg@3: 0.97753\tvalid_0's ndcg@4: 0.977821\tvalid_0's ndcg@5: 0.977898\n",
+ "[57]\tvalid_0's ndcg@1: 0.9416\tvalid_0's ndcg@2: 0.975891\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977741\tvalid_0's ndcg@5: 0.977818\n",
+ "[58]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977494\tvalid_0's ndcg@4: 0.977795\tvalid_0's ndcg@5: 0.977873\n",
+ "[59]\tvalid_0's ndcg@1: 0.942025\tvalid_0's ndcg@2: 0.975985\tvalid_0's ndcg@3: 0.977547\tvalid_0's ndcg@4: 0.977881\tvalid_0's ndcg@5: 0.977958\n",
+ "[60]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975994\tvalid_0's ndcg@3: 0.977569\tvalid_0's ndcg@4: 0.977892\tvalid_0's ndcg@5: 0.977969\n",
+ "[61]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977559\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.97796\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[62]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976027\tvalid_0's ndcg@3: 0.97764\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.978028\n",
+ "[63]\tvalid_0's ndcg@1: 0.942125\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977622\tvalid_0's ndcg@4: 0.977912\tvalid_0's ndcg@5: 0.977999\n",
+ "[64]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977793\tvalid_0's ndcg@4: 0.978105\tvalid_0's ndcg@5: 0.978192\n",
+ "[65]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976227\tvalid_0's ndcg@3: 0.977802\tvalid_0's ndcg@4: 0.978125\tvalid_0's ndcg@5: 0.978212\n",
+ "[66]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976132\tvalid_0's ndcg@3: 0.977695\tvalid_0's ndcg@4: 0.978018\tvalid_0's ndcg@5: 0.978105\n",
+ "[67]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976092\tvalid_0's ndcg@3: 0.977679\tvalid_0's ndcg@4: 0.978002\tvalid_0's ndcg@5: 0.978089\n",
+ "[68]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976148\tvalid_0's ndcg@3: 0.977698\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.978108\n",
+ "[69]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976123\tvalid_0's ndcg@3: 0.977686\tvalid_0's ndcg@4: 0.978009\tvalid_0's ndcg@5: 0.978096\n",
+ "[70]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976222\tvalid_0's ndcg@3: 0.977785\tvalid_0's ndcg@4: 0.978097\tvalid_0's ndcg@5: 0.978184\n",
+ "[71]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976188\tvalid_0's ndcg@3: 0.977763\tvalid_0's ndcg@4: 0.978075\tvalid_0's ndcg@5: 0.978162\n",
+ "[72]\tvalid_0's ndcg@1: 0.9427\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977809\tvalid_0's ndcg@4: 0.978121\tvalid_0's ndcg@5: 0.978208\n",
+ "[73]\tvalid_0's ndcg@1: 0.9428\tvalid_0's ndcg@2: 0.976255\tvalid_0's ndcg@3: 0.977843\tvalid_0's ndcg@4: 0.978155\tvalid_0's ndcg@5: 0.978242\n",
+ "[74]\tvalid_0's ndcg@1: 0.94295\tvalid_0's ndcg@2: 0.97631\tvalid_0's ndcg@3: 0.977898\tvalid_0's ndcg@4: 0.97821\tvalid_0's ndcg@5: 0.978297\n",
+ "[75]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976329\tvalid_0's ndcg@3: 0.977941\tvalid_0's ndcg@4: 0.978232\tvalid_0's ndcg@5: 0.978319\n",
+ "[76]\tvalid_0's ndcg@1: 0.9433\tvalid_0's ndcg@2: 0.976471\tvalid_0's ndcg@3: 0.978059\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978437\n",
+ "[77]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976416\tvalid_0's ndcg@3: 0.977991\tvalid_0's ndcg@4: 0.978314\tvalid_0's ndcg@5: 0.978381\n",
+ "[78]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976657\tvalid_0's ndcg@3: 0.978194\tvalid_0's ndcg@4: 0.978517\tvalid_0's ndcg@5: 0.978585\n",
+ "[79]\tvalid_0's ndcg@1: 0.94365\tvalid_0's ndcg@2: 0.976663\tvalid_0's ndcg@3: 0.978188\tvalid_0's ndcg@4: 0.978501\tvalid_0's ndcg@5: 0.978578\n",
+ "[80]\tvalid_0's ndcg@1: 0.943725\tvalid_0's ndcg@2: 0.976628\tvalid_0's ndcg@3: 0.978203\tvalid_0's ndcg@4: 0.978515\tvalid_0's ndcg@5: 0.978593\n",
+ "[81]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97672\tvalid_0's ndcg@3: 0.978295\tvalid_0's ndcg@4: 0.978607\tvalid_0's ndcg@5: 0.978685\n",
+ "[82]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978397\tvalid_0's ndcg@4: 0.97872\tvalid_0's ndcg@5: 0.978787\n",
+ "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976788\tvalid_0's ndcg@3: 0.978375\tvalid_0's ndcg@4: 0.978698\tvalid_0's ndcg@5: 0.978766\n",
+ "[84]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.97679\tvalid_0's ndcg@3: 0.97839\tvalid_0's ndcg@4: 0.978702\tvalid_0's ndcg@5: 0.97878\n",
+ "[85]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.976809\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978723\tvalid_0's ndcg@5: 0.9788\n",
+ "[86]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976939\tvalid_0's ndcg@3: 0.978502\tvalid_0's ndcg@4: 0.978814\tvalid_0's ndcg@5: 0.978891\n",
+ "[87]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.976976\tvalid_0's ndcg@3: 0.978551\tvalid_0's ndcg@4: 0.978852\tvalid_0's ndcg@5: 0.97893\n",
+ "[88]\tvalid_0's ndcg@1: 0.944925\tvalid_0's ndcg@2: 0.977102\tvalid_0's ndcg@3: 0.978677\tvalid_0's ndcg@4: 0.978968\tvalid_0's ndcg@5: 0.979045\n",
+ "[89]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978758\tvalid_0's ndcg@4: 0.979048\tvalid_0's ndcg@5: 0.979126\n",
+ "[90]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.978735\tvalid_0's ndcg@4: 0.979026\tvalid_0's ndcg@5: 0.979104\n",
+ "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978858\tvalid_0's ndcg@4: 0.979138\tvalid_0's ndcg@5: 0.979215\n",
+ "[92]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.977267\tvalid_0's ndcg@3: 0.978905\tvalid_0's ndcg@4: 0.979174\tvalid_0's ndcg@5: 0.979251\n",
+ "[93]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977193\tvalid_0's ndcg@3: 0.978818\tvalid_0's ndcg@4: 0.979098\tvalid_0's ndcg@5: 0.979176\n",
+ "[94]\tvalid_0's ndcg@1: 0.94545\tvalid_0's ndcg@2: 0.97728\tvalid_0's ndcg@3: 0.97888\tvalid_0's ndcg@4: 0.97916\tvalid_0's ndcg@5: 0.979238\n",
+ "[95]\tvalid_0's ndcg@1: 0.9458\tvalid_0's ndcg@2: 0.977394\tvalid_0's ndcg@3: 0.979006\tvalid_0's ndcg@4: 0.979286\tvalid_0's ndcg@5: 0.979364\n",
+ "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979114\tvalid_0's ndcg@4: 0.979394\tvalid_0's ndcg@5: 0.979472\n",
+ "[97]\tvalid_0's ndcg@1: 0.946475\tvalid_0's ndcg@2: 0.977659\tvalid_0's ndcg@3: 0.979259\tvalid_0's ndcg@4: 0.979539\tvalid_0's ndcg@5: 0.979616\n",
+ "[98]\tvalid_0's ndcg@1: 0.94675\tvalid_0's ndcg@2: 0.97776\tvalid_0's ndcg@3: 0.97936\tvalid_0's ndcg@4: 0.979651\tvalid_0's ndcg@5: 0.979719\n",
+ "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
+ "[100]\tvalid_0's ndcg@1: 0.9468\tvalid_0's ndcg@2: 0.977794\tvalid_0's ndcg@3: 0.979369\tvalid_0's ndcg@4: 0.979671\tvalid_0's ndcg@5: 0.979739\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
+ "[1]\tvalid_0's ndcg@1: 0.909075\tvalid_0's ndcg@2: 0.963019\tvalid_0's ndcg@3: 0.965069\tvalid_0's ndcg@4: 0.965543\tvalid_0's ndcg@5: 0.965601\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9123\tvalid_0's ndcg@2: 0.964273\tvalid_0's ndcg@3: 0.966248\tvalid_0's ndcg@4: 0.966722\tvalid_0's ndcg@5: 0.966789\n",
+ "[3]\tvalid_0's ndcg@1: 0.915075\tvalid_0's ndcg@2: 0.965691\tvalid_0's ndcg@3: 0.967466\tvalid_0's ndcg@4: 0.967854\tvalid_0's ndcg@5: 0.967922\n",
+ "[4]\tvalid_0's ndcg@1: 0.91845\tvalid_0's ndcg@2: 0.967047\tvalid_0's ndcg@3: 0.968735\tvalid_0's ndcg@4: 0.969133\tvalid_0's ndcg@5: 0.969201\n",
+ "[5]\tvalid_0's ndcg@1: 0.92355\tvalid_0's ndcg@2: 0.968961\tvalid_0's ndcg@3: 0.970674\tvalid_0's ndcg@4: 0.97104\tvalid_0's ndcg@5: 0.971098\n",
+ "[6]\tvalid_0's ndcg@1: 0.9253\tvalid_0's ndcg@2: 0.969607\tvalid_0's ndcg@3: 0.971345\tvalid_0's ndcg@4: 0.971689\tvalid_0's ndcg@5: 0.971747\n",
+ "[7]\tvalid_0's ndcg@1: 0.926225\tvalid_0's ndcg@2: 0.969933\tvalid_0's ndcg@3: 0.971708\tvalid_0's ndcg@4: 0.972031\tvalid_0's ndcg@5: 0.972079\n",
+ "[8]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.970104\tvalid_0's ndcg@3: 0.971804\tvalid_0's ndcg@4: 0.972116\tvalid_0's ndcg@5: 0.972184\n",
+ "[9]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970682\tvalid_0's ndcg@3: 0.972307\tvalid_0's ndcg@4: 0.972598\tvalid_0's ndcg@5: 0.972675\n",
+ "[10]\tvalid_0's ndcg@1: 0.92775\tvalid_0's ndcg@2: 0.970653\tvalid_0's ndcg@3: 0.972316\tvalid_0's ndcg@4: 0.972617\tvalid_0's ndcg@5: 0.972685\n",
+ "[11]\tvalid_0's ndcg@1: 0.9283\tvalid_0's ndcg@2: 0.97084\tvalid_0's ndcg@3: 0.97254\tvalid_0's ndcg@4: 0.97281\tvalid_0's ndcg@5: 0.972887\n",
+ "[12]\tvalid_0's ndcg@1: 0.9287\tvalid_0's ndcg@2: 0.971051\tvalid_0's ndcg@3: 0.972701\tvalid_0's ndcg@4: 0.97297\tvalid_0's ndcg@5: 0.973048\n",
+ "[13]\tvalid_0's ndcg@1: 0.9297\tvalid_0's ndcg@2: 0.971389\tvalid_0's ndcg@3: 0.973001\tvalid_0's ndcg@4: 0.973313\tvalid_0's ndcg@5: 0.9734\n",
+ "[14]\tvalid_0's ndcg@1: 0.92955\tvalid_0's ndcg@2: 0.971444\tvalid_0's ndcg@3: 0.972994\tvalid_0's ndcg@4: 0.973284\tvalid_0's ndcg@5: 0.973371\n",
+ "[15]\tvalid_0's ndcg@1: 0.930225\tvalid_0's ndcg@2: 0.97174\tvalid_0's ndcg@3: 0.973253\tvalid_0's ndcg@4: 0.973543\tvalid_0's ndcg@5: 0.97363\n",
+ "[16]\tvalid_0's ndcg@1: 0.930425\tvalid_0's ndcg@2: 0.971798\tvalid_0's ndcg@3: 0.973298\tvalid_0's ndcg@4: 0.97361\tvalid_0's ndcg@5: 0.973698\n",
+ "[17]\tvalid_0's ndcg@1: 0.93125\tvalid_0's ndcg@2: 0.971992\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973903\tvalid_0's ndcg@5: 0.97398\n",
+ "[18]\tvalid_0's ndcg@1: 0.931925\tvalid_0's ndcg@2: 0.972257\tvalid_0's ndcg@3: 0.973845\tvalid_0's ndcg@4: 0.974146\tvalid_0's ndcg@5: 0.974224\n",
+ "[19]\tvalid_0's ndcg@1: 0.932375\tvalid_0's ndcg@2: 0.972376\tvalid_0's ndcg@3: 0.974038\tvalid_0's ndcg@4: 0.974318\tvalid_0's ndcg@5: 0.974376\n",
+ "[20]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.972269\tvalid_0's ndcg@3: 0.973907\tvalid_0's ndcg@4: 0.974187\tvalid_0's ndcg@5: 0.974245\n",
+ "[21]\tvalid_0's ndcg@1: 0.932725\tvalid_0's ndcg@2: 0.972568\tvalid_0's ndcg@3: 0.974181\tvalid_0's ndcg@4: 0.974471\tvalid_0's ndcg@5: 0.974529\n",
+ "[22]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972735\tvalid_0's ndcg@3: 0.974298\tvalid_0's ndcg@4: 0.974599\tvalid_0's ndcg@5: 0.974657\n",
+ "[23]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972642\tvalid_0's ndcg@3: 0.974255\tvalid_0's ndcg@4: 0.974545\tvalid_0's ndcg@5: 0.974594\n",
+ "[24]\tvalid_0's ndcg@1: 0.933175\tvalid_0's ndcg@2: 0.972734\tvalid_0's ndcg@3: 0.974347\tvalid_0's ndcg@4: 0.974638\tvalid_0's ndcg@5: 0.974686\n",
+ "[25]\tvalid_0's ndcg@1: 0.9331\tvalid_0's ndcg@2: 0.972754\tvalid_0's ndcg@3: 0.974366\tvalid_0's ndcg@4: 0.974636\tvalid_0's ndcg@5: 0.974674\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[26]\tvalid_0's ndcg@1: 0.933275\tvalid_0's ndcg@2: 0.972787\tvalid_0's ndcg@3: 0.974424\tvalid_0's ndcg@4: 0.974694\tvalid_0's ndcg@5: 0.974732\n",
+ "[27]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972809\tvalid_0's ndcg@3: 0.974434\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.974732\n",
+ "[28]\tvalid_0's ndcg@1: 0.933625\tvalid_0's ndcg@2: 0.972932\tvalid_0's ndcg@3: 0.974557\tvalid_0's ndcg@4: 0.974826\tvalid_0's ndcg@5: 0.974855\n",
+ "[29]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972937\tvalid_0's ndcg@3: 0.974587\tvalid_0's ndcg@4: 0.974856\tvalid_0's ndcg@5: 0.974885\n",
+ "[30]\tvalid_0's ndcg@1: 0.93355\tvalid_0's ndcg@2: 0.972873\tvalid_0's ndcg@3: 0.974523\tvalid_0's ndcg@4: 0.974792\tvalid_0's ndcg@5: 0.974821\n",
+ "[31]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973065\tvalid_0's ndcg@3: 0.974753\tvalid_0's ndcg@4: 0.975022\tvalid_0's ndcg@5: 0.975051\n",
+ "[32]\tvalid_0's ndcg@1: 0.93435\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974815\tvalid_0's ndcg@4: 0.975084\tvalid_0's ndcg@5: 0.975113\n",
+ "[33]\tvalid_0's ndcg@1: 0.934475\tvalid_0's ndcg@2: 0.97323\tvalid_0's ndcg@3: 0.974855\tvalid_0's ndcg@4: 0.975135\tvalid_0's ndcg@5: 0.975164\n",
+ "[34]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973113\tvalid_0's ndcg@3: 0.974738\tvalid_0's ndcg@4: 0.975028\tvalid_0's ndcg@5: 0.975057\n",
+ "[35]\tvalid_0's ndcg@1: 0.93455\tvalid_0's ndcg@2: 0.973258\tvalid_0's ndcg@3: 0.97487\tvalid_0's ndcg@4: 0.975172\tvalid_0's ndcg@5: 0.975201\n",
+ "[36]\tvalid_0's ndcg@1: 0.9344\tvalid_0's ndcg@2: 0.973265\tvalid_0's ndcg@3: 0.974828\tvalid_0's ndcg@4: 0.975129\tvalid_0's ndcg@5: 0.975158\n",
+ "[37]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973438\tvalid_0's ndcg@3: 0.975013\tvalid_0's ndcg@4: 0.975304\tvalid_0's ndcg@5: 0.975323\n",
+ "[38]\tvalid_0's ndcg@1: 0.934975\tvalid_0's ndcg@2: 0.973541\tvalid_0's ndcg@3: 0.975066\tvalid_0's ndcg@4: 0.975367\tvalid_0's ndcg@5: 0.975386\n",
+ "[39]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973667\tvalid_0's ndcg@3: 0.975192\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975502\n",
+ "[40]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973624\tvalid_0's ndcg@3: 0.975174\tvalid_0's ndcg@4: 0.975454\tvalid_0's ndcg@5: 0.975473\n",
+ "[41]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973686\tvalid_0's ndcg@3: 0.975223\tvalid_0's ndcg@4: 0.975503\tvalid_0's ndcg@5: 0.975522\n",
+ "[42]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973716\tvalid_0's ndcg@3: 0.975266\tvalid_0's ndcg@4: 0.975546\tvalid_0's ndcg@5: 0.975565\n",
+ "[43]\tvalid_0's ndcg@1: 0.93615\tvalid_0's ndcg@2: 0.974022\tvalid_0's ndcg@3: 0.975534\tvalid_0's ndcg@4: 0.975814\tvalid_0's ndcg@5: 0.975843\n",
+ "[44]\tvalid_0's ndcg@1: 0.936225\tvalid_0's ndcg@2: 0.974112\tvalid_0's ndcg@3: 0.975562\tvalid_0's ndcg@4: 0.975853\tvalid_0's ndcg@5: 0.975882\n",
+ "[45]\tvalid_0's ndcg@1: 0.9365\tvalid_0's ndcg@2: 0.974167\tvalid_0's ndcg@3: 0.975654\tvalid_0's ndcg@4: 0.975945\tvalid_0's ndcg@5: 0.975974\n",
+ "[46]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974206\tvalid_0's ndcg@3: 0.975694\tvalid_0's ndcg@4: 0.975995\tvalid_0's ndcg@5: 0.976024\n",
+ "[47]\tvalid_0's ndcg@1: 0.93685\tvalid_0's ndcg@2: 0.974311\tvalid_0's ndcg@3: 0.975786\tvalid_0's ndcg@4: 0.976077\tvalid_0's ndcg@5: 0.976106\n",
+ "[48]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974408\tvalid_0's ndcg@3: 0.975845\tvalid_0's ndcg@4: 0.976147\tvalid_0's ndcg@5: 0.976185\n",
+ "[49]\tvalid_0's ndcg@1: 0.936975\tvalid_0's ndcg@2: 0.974342\tvalid_0's ndcg@3: 0.975829\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.976159\n",
+ "[50]\tvalid_0's ndcg@1: 0.9371\tvalid_0's ndcg@2: 0.974388\tvalid_0's ndcg@3: 0.97585\tvalid_0's ndcg@4: 0.976152\tvalid_0's ndcg@5: 0.976191\n",
+ "[51]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974329\tvalid_0's ndcg@3: 0.975841\tvalid_0's ndcg@4: 0.976121\tvalid_0's ndcg@5: 0.97616\n",
+ "[52]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974578\tvalid_0's ndcg@3: 0.976078\tvalid_0's ndcg@4: 0.976369\tvalid_0's ndcg@5: 0.976407\n",
+ "[53]\tvalid_0's ndcg@1: 0.9378\tvalid_0's ndcg@2: 0.974615\tvalid_0's ndcg@3: 0.976115\tvalid_0's ndcg@4: 0.976405\tvalid_0's ndcg@5: 0.976444\n",
+ "[54]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974689\tvalid_0's ndcg@3: 0.976214\tvalid_0's ndcg@4: 0.976483\tvalid_0's ndcg@5: 0.976521\n",
+ "[55]\tvalid_0's ndcg@1: 0.938225\tvalid_0's ndcg@2: 0.974803\tvalid_0's ndcg@3: 0.976303\tvalid_0's ndcg@4: 0.976572\tvalid_0's ndcg@5: 0.976611\n",
+ "[56]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.9748\tvalid_0's ndcg@3: 0.976275\tvalid_0's ndcg@4: 0.976555\tvalid_0's ndcg@5: 0.976594\n",
+ "[57]\tvalid_0's ndcg@1: 0.938525\tvalid_0's ndcg@2: 0.974914\tvalid_0's ndcg@3: 0.976414\tvalid_0's ndcg@4: 0.976683\tvalid_0's ndcg@5: 0.976722\n",
+ "[58]\tvalid_0's ndcg@1: 0.93875\tvalid_0's ndcg@2: 0.975028\tvalid_0's ndcg@3: 0.976503\tvalid_0's ndcg@4: 0.976773\tvalid_0's ndcg@5: 0.976811\n",
+ "[59]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975198\tvalid_0's ndcg@3: 0.976648\tvalid_0's ndcg@4: 0.976918\tvalid_0's ndcg@5: 0.976956\n",
+ "[60]\tvalid_0's ndcg@1: 0.939025\tvalid_0's ndcg@2: 0.975177\tvalid_0's ndcg@3: 0.976615\tvalid_0's ndcg@4: 0.976884\tvalid_0's ndcg@5: 0.976923\n",
+ "[61]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975205\tvalid_0's ndcg@3: 0.976642\tvalid_0's ndcg@4: 0.976912\tvalid_0's ndcg@5: 0.97695\n",
+ "[62]\tvalid_0's ndcg@1: 0.93965\tvalid_0's ndcg@2: 0.975424\tvalid_0's ndcg@3: 0.976836\tvalid_0's ndcg@4: 0.977116\tvalid_0's ndcg@5: 0.977155\n",
+ "[63]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.976996\tvalid_0's ndcg@4: 0.977276\tvalid_0's ndcg@5: 0.977315\n",
+ "[64]\tvalid_0's ndcg@1: 0.940375\tvalid_0's ndcg@2: 0.975723\tvalid_0's ndcg@3: 0.977123\tvalid_0's ndcg@4: 0.977392\tvalid_0's ndcg@5: 0.977431\n",
+ "[65]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977154\tvalid_0's ndcg@4: 0.977423\tvalid_0's ndcg@5: 0.977462\n",
+ "[66]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975744\tvalid_0's ndcg@3: 0.977156\tvalid_0's ndcg@4: 0.977426\tvalid_0's ndcg@5: 0.977464\n",
+ "[67]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.97576\tvalid_0's ndcg@3: 0.977172\tvalid_0's ndcg@4: 0.977431\tvalid_0's ndcg@5: 0.977469\n",
+ "[68]\tvalid_0's ndcg@1: 0.940675\tvalid_0's ndcg@2: 0.975849\tvalid_0's ndcg@3: 0.977249\tvalid_0's ndcg@4: 0.977508\tvalid_0's ndcg@5: 0.977546\n",
+ "[69]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.976017\tvalid_0's ndcg@3: 0.977454\tvalid_0's ndcg@4: 0.977724\tvalid_0's ndcg@5: 0.977762\n",
+ "[70]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.977362\tvalid_0's ndcg@4: 0.977631\tvalid_0's ndcg@5: 0.97767\n",
+ "[71]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.97735\tvalid_0's ndcg@4: 0.97763\tvalid_0's ndcg@5: 0.977668\n",
+ "[72]\tvalid_0's ndcg@1: 0.941325\tvalid_0's ndcg@2: 0.976058\tvalid_0's ndcg@3: 0.97747\tvalid_0's ndcg@4: 0.977739\tvalid_0's ndcg@5: 0.977778\n",
+ "[73]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977476\tvalid_0's ndcg@4: 0.977756\tvalid_0's ndcg@5: 0.977795\n",
+ "[74]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.97759\tvalid_0's ndcg@4: 0.97788\tvalid_0's ndcg@5: 0.977919\n",
+ "[75]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.977602\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.977921\n",
+ "[76]\tvalid_0's ndcg@1: 0.94195\tvalid_0's ndcg@2: 0.976273\tvalid_0's ndcg@3: 0.977685\tvalid_0's ndcg@4: 0.977965\tvalid_0's ndcg@5: 0.978004\n",
+ "[77]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.97627\tvalid_0's ndcg@3: 0.97767\tvalid_0's ndcg@4: 0.97795\tvalid_0's ndcg@5: 0.977989\n",
+ "[78]\tvalid_0's ndcg@1: 0.94235\tvalid_0's ndcg@2: 0.976452\tvalid_0's ndcg@3: 0.977839\tvalid_0's ndcg@4: 0.978119\tvalid_0's ndcg@5: 0.978158\n",
+ "[79]\tvalid_0's ndcg@1: 0.94265\tvalid_0's ndcg@2: 0.976562\tvalid_0's ndcg@3: 0.977937\tvalid_0's ndcg@4: 0.978228\tvalid_0's ndcg@5: 0.978267\n",
+ "[80]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976667\tvalid_0's ndcg@3: 0.978067\tvalid_0's ndcg@4: 0.978347\tvalid_0's ndcg@5: 0.978385\n",
+ "[81]\tvalid_0's ndcg@1: 0.94305\tvalid_0's ndcg@2: 0.97671\tvalid_0's ndcg@3: 0.978098\tvalid_0's ndcg@4: 0.978378\tvalid_0's ndcg@5: 0.978416\n",
+ "[82]\tvalid_0's ndcg@1: 0.943175\tvalid_0's ndcg@2: 0.97674\tvalid_0's ndcg@3: 0.978115\tvalid_0's ndcg@4: 0.978417\tvalid_0's ndcg@5: 0.978456\n",
+ "[83]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976752\tvalid_0's ndcg@3: 0.97814\tvalid_0's ndcg@4: 0.978441\tvalid_0's ndcg@5: 0.97848\n",
+ "[84]\tvalid_0's ndcg@1: 0.943375\tvalid_0's ndcg@2: 0.976767\tvalid_0's ndcg@3: 0.978179\tvalid_0's ndcg@4: 0.978481\tvalid_0's ndcg@5: 0.97852\n",
+ "[85]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976721\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978437\tvalid_0's ndcg@5: 0.978475\n",
+ "[86]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976792\tvalid_0's ndcg@3: 0.978204\tvalid_0's ndcg@4: 0.978506\tvalid_0's ndcg@5: 0.978535\n",
+ "[87]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.976851\tvalid_0's ndcg@3: 0.978239\tvalid_0's ndcg@4: 0.97854\tvalid_0's ndcg@5: 0.978569\n",
+ "[88]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976882\tvalid_0's ndcg@3: 0.978282\tvalid_0's ndcg@4: 0.978572\tvalid_0's ndcg@5: 0.978611\n",
+ "[89]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.976915\tvalid_0's ndcg@3: 0.97834\tvalid_0's ndcg@4: 0.97863\tvalid_0's ndcg@5: 0.978669\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[90]\tvalid_0's ndcg@1: 0.943925\tvalid_0's ndcg@2: 0.976986\tvalid_0's ndcg@3: 0.978398\tvalid_0's ndcg@4: 0.978689\tvalid_0's ndcg@5: 0.978728\n",
+ "[91]\tvalid_0's ndcg@1: 0.943875\tvalid_0's ndcg@2: 0.976999\tvalid_0's ndcg@3: 0.978399\tvalid_0's ndcg@4: 0.978679\tvalid_0's ndcg@5: 0.978717\n",
+ "[92]\tvalid_0's ndcg@1: 0.94395\tvalid_0's ndcg@2: 0.977058\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978711\tvalid_0's ndcg@5: 0.97876\n",
+ "[93]\tvalid_0's ndcg@1: 0.944075\tvalid_0's ndcg@2: 0.977104\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978759\tvalid_0's ndcg@5: 0.978807\n",
+ "[94]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977125\tvalid_0's ndcg@3: 0.978513\tvalid_0's ndcg@4: 0.978793\tvalid_0's ndcg@5: 0.978841\n",
+ "[95]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977153\tvalid_0's ndcg@3: 0.97854\tvalid_0's ndcg@4: 0.97882\tvalid_0's ndcg@5: 0.978869\n",
+ "[96]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977144\tvalid_0's ndcg@3: 0.978531\tvalid_0's ndcg@4: 0.978811\tvalid_0's ndcg@5: 0.97886\n",
+ "[97]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977221\tvalid_0's ndcg@3: 0.978584\tvalid_0's ndcg@4: 0.978864\tvalid_0's ndcg@5: 0.978912\n",
+ "[98]\tvalid_0's ndcg@1: 0.944575\tvalid_0's ndcg@2: 0.977289\tvalid_0's ndcg@3: 0.978651\tvalid_0's ndcg@4: 0.978942\tvalid_0's ndcg@5: 0.97899\n",
+ "[99]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977341\tvalid_0's ndcg@3: 0.978691\tvalid_0's ndcg@4: 0.978993\tvalid_0's ndcg@5: 0.979032\n",
+ "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
+ "[1]\tvalid_0's ndcg@1: 0.911575\tvalid_0's ndcg@2: 0.964384\tvalid_0's ndcg@3: 0.966321\tvalid_0's ndcg@4: 0.966623\tvalid_0's ndcg@5: 0.966671\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9136\tvalid_0's ndcg@2: 0.965257\tvalid_0's ndcg@3: 0.967107\tvalid_0's ndcg@4: 0.967398\tvalid_0's ndcg@5: 0.967456\n",
+ "[3]\tvalid_0's ndcg@1: 0.917425\tvalid_0's ndcg@2: 0.966732\tvalid_0's ndcg@3: 0.968545\tvalid_0's ndcg@4: 0.968814\tvalid_0's ndcg@5: 0.968882\n",
+ "[4]\tvalid_0's ndcg@1: 0.9222\tvalid_0's ndcg@2: 0.968558\tvalid_0's ndcg@3: 0.970383\tvalid_0's ndcg@4: 0.970619\tvalid_0's ndcg@5: 0.970668\n",
+ "[5]\tvalid_0's ndcg@1: 0.925875\tvalid_0's ndcg@2: 0.969914\tvalid_0's ndcg@3: 0.971714\tvalid_0's ndcg@4: 0.971972\tvalid_0's ndcg@5: 0.972021\n",
+ "[6]\tvalid_0's ndcg@1: 0.926875\tvalid_0's ndcg@2: 0.970425\tvalid_0's ndcg@3: 0.972112\tvalid_0's ndcg@4: 0.972371\tvalid_0's ndcg@5: 0.972419\n",
+ "[7]\tvalid_0's ndcg@1: 0.927475\tvalid_0's ndcg@2: 0.970631\tvalid_0's ndcg@3: 0.972306\tvalid_0's ndcg@4: 0.972586\tvalid_0's ndcg@5: 0.972634\n",
+ "[8]\tvalid_0's ndcg@1: 0.93015\tvalid_0's ndcg@2: 0.971649\tvalid_0's ndcg@3: 0.973287\tvalid_0's ndcg@4: 0.973567\tvalid_0's ndcg@5: 0.973625\n",
+ "[9]\tvalid_0's ndcg@1: 0.9312\tvalid_0's ndcg@2: 0.972084\tvalid_0's ndcg@3: 0.973684\tvalid_0's ndcg@4: 0.973964\tvalid_0's ndcg@5: 0.974022\n",
+ "[10]\tvalid_0's ndcg@1: 0.93225\tvalid_0's ndcg@2: 0.972456\tvalid_0's ndcg@3: 0.974081\tvalid_0's ndcg@4: 0.974361\tvalid_0's ndcg@5: 0.974409\n",
+ "[11]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972704\tvalid_0's ndcg@3: 0.974379\tvalid_0's ndcg@4: 0.974648\tvalid_0's ndcg@5: 0.974696\n",
+ "[12]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974574\tvalid_0's ndcg@4: 0.974832\tvalid_0's ndcg@5: 0.974881\n",
+ "[13]\tvalid_0's ndcg@1: 0.93415\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.97482\tvalid_0's ndcg@4: 0.975079\tvalid_0's ndcg@5: 0.975127\n",
+ "[14]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973671\tvalid_0's ndcg@3: 0.975246\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975531\n",
+ "[15]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.973877\tvalid_0's ndcg@3: 0.975452\tvalid_0's ndcg@4: 0.975699\tvalid_0's ndcg@5: 0.975748\n",
+ "[16]\tvalid_0's ndcg@1: 0.935825\tvalid_0's ndcg@2: 0.973917\tvalid_0's ndcg@3: 0.975442\tvalid_0's ndcg@4: 0.975712\tvalid_0's ndcg@5: 0.97576\n",
+ "[17]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.97411\tvalid_0's ndcg@3: 0.975697\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975995\n",
+ "[18]\tvalid_0's ndcg@1: 0.936925\tvalid_0's ndcg@2: 0.974292\tvalid_0's ndcg@3: 0.975867\tvalid_0's ndcg@4: 0.976114\tvalid_0's ndcg@5: 0.976163\n",
+ "[19]\tvalid_0's ndcg@1: 0.937525\tvalid_0's ndcg@2: 0.974545\tvalid_0's ndcg@3: 0.976095\tvalid_0's ndcg@4: 0.976342\tvalid_0's ndcg@5: 0.976391\n",
+ "[20]\tvalid_0's ndcg@1: 0.937775\tvalid_0's ndcg@2: 0.974653\tvalid_0's ndcg@3: 0.976203\tvalid_0's ndcg@4: 0.976429\tvalid_0's ndcg@5: 0.976487\n",
+ "[21]\tvalid_0's ndcg@1: 0.938825\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976597\tvalid_0's ndcg@4: 0.976823\tvalid_0's ndcg@5: 0.976881\n",
+ "[22]\tvalid_0's ndcg@1: 0.93885\tvalid_0's ndcg@2: 0.975097\tvalid_0's ndcg@3: 0.976609\tvalid_0's ndcg@4: 0.976846\tvalid_0's ndcg@5: 0.976895\n",
+ "[23]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976733\tvalid_0's ndcg@4: 0.976959\tvalid_0's ndcg@5: 0.977008\n",
+ "[24]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976721\tvalid_0's ndcg@4: 0.976947\tvalid_0's ndcg@5: 0.977005\n",
+ "[25]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975421\tvalid_0's ndcg@3: 0.976909\tvalid_0's ndcg@4: 0.977124\tvalid_0's ndcg@5: 0.977182\n",
+ "[26]\tvalid_0's ndcg@1: 0.9393\tvalid_0's ndcg@2: 0.975342\tvalid_0's ndcg@3: 0.976804\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977078\n",
+ "[27]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975323\tvalid_0's ndcg@3: 0.976798\tvalid_0's ndcg@4: 0.977014\tvalid_0's ndcg@5: 0.977062\n",
+ "[28]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975308\tvalid_0's ndcg@3: 0.976783\tvalid_0's ndcg@4: 0.977009\tvalid_0's ndcg@5: 0.977057\n",
+ "[29]\tvalid_0's ndcg@1: 0.94\tvalid_0's ndcg@2: 0.975569\tvalid_0's ndcg@3: 0.977056\tvalid_0's ndcg@4: 0.977282\tvalid_0's ndcg@5: 0.977331\n",
+ "[30]\tvalid_0's ndcg@1: 0.940325\tvalid_0's ndcg@2: 0.975673\tvalid_0's ndcg@3: 0.977173\tvalid_0's ndcg@4: 0.977399\tvalid_0's ndcg@5: 0.977447\n",
+ "[31]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975731\tvalid_0's ndcg@3: 0.977243\tvalid_0's ndcg@4: 0.977469\tvalid_0's ndcg@5: 0.977518\n",
+ "[32]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
+ "[33]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977241\tvalid_0's ndcg@4: 0.977457\tvalid_0's ndcg@5: 0.977505\n",
+ "[34]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
+ "[35]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975868\tvalid_0's ndcg@3: 0.977343\tvalid_0's ndcg@4: 0.977558\tvalid_0's ndcg@5: 0.977606\n",
+ "[36]\tvalid_0's ndcg@1: 0.94115\tvalid_0's ndcg@2: 0.976056\tvalid_0's ndcg@3: 0.977506\tvalid_0's ndcg@4: 0.977722\tvalid_0's ndcg@5: 0.97777\n",
+ "[37]\tvalid_0's ndcg@1: 0.9414\tvalid_0's ndcg@2: 0.976133\tvalid_0's ndcg@3: 0.977595\tvalid_0's ndcg@4: 0.977811\tvalid_0's ndcg@5: 0.977859\n",
+ "[38]\tvalid_0's ndcg@1: 0.94175\tvalid_0's ndcg@2: 0.976278\tvalid_0's ndcg@3: 0.977715\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.97799\n",
+ "[39]\tvalid_0's ndcg@1: 0.942075\tvalid_0's ndcg@2: 0.976366\tvalid_0's ndcg@3: 0.977841\tvalid_0's ndcg@4: 0.978056\tvalid_0's ndcg@5: 0.978105\n",
+ "[40]\tvalid_0's ndcg@1: 0.94215\tvalid_0's ndcg@2: 0.976409\tvalid_0's ndcg@3: 0.977872\tvalid_0's ndcg@4: 0.978087\tvalid_0's ndcg@5: 0.978136\n",
+ "[41]\tvalid_0's ndcg@1: 0.94245\tvalid_0's ndcg@2: 0.97652\tvalid_0's ndcg@3: 0.977983\tvalid_0's ndcg@4: 0.978198\tvalid_0's ndcg@5: 0.978246\n",
+ "[42]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
+ "[43]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
+ "[44]\tvalid_0's ndcg@1: 0.94285\tvalid_0's ndcg@2: 0.976636\tvalid_0's ndcg@3: 0.978111\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978386\n",
+ "[45]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.9768\tvalid_0's ndcg@3: 0.978262\tvalid_0's ndcg@4: 0.978488\tvalid_0's ndcg@5: 0.978537\n",
+ "[46]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
+ "[47]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97836\tvalid_0's ndcg@4: 0.978576\tvalid_0's ndcg@5: 0.978634\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[48]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.978373\tvalid_0's ndcg@4: 0.978577\tvalid_0's ndcg@5: 0.978636\n",
+ "[49]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
+ "[50]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97702\tvalid_0's ndcg@3: 0.97852\tvalid_0's ndcg@4: 0.978746\tvalid_0's ndcg@5: 0.978794\n",
+ "[51]\tvalid_0's ndcg@1: 0.9441\tvalid_0's ndcg@2: 0.97705\tvalid_0's ndcg@3: 0.97855\tvalid_0's ndcg@4: 0.978787\tvalid_0's ndcg@5: 0.978836\n",
+ "[52]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977121\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978846\tvalid_0's ndcg@5: 0.978894\n",
+ "[53]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977081\tvalid_0's ndcg@3: 0.978618\tvalid_0's ndcg@4: 0.978834\tvalid_0's ndcg@5: 0.978882\n",
+ "[54]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977071\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978824\tvalid_0's ndcg@5: 0.978873\n",
+ "[55]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977143\tvalid_0's ndcg@3: 0.978668\tvalid_0's ndcg@4: 0.978883\tvalid_0's ndcg@5: 0.978931\n",
+ "[56]\tvalid_0's ndcg@1: 0.9444\tvalid_0's ndcg@2: 0.977177\tvalid_0's ndcg@3: 0.978702\tvalid_0's ndcg@4: 0.978906\tvalid_0's ndcg@5: 0.978955\n",
+ "[57]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977263\tvalid_0's ndcg@3: 0.978788\tvalid_0's ndcg@4: 0.979003\tvalid_0's ndcg@5: 0.979051\n",
+ "[58]\tvalid_0's ndcg@1: 0.9448\tvalid_0's ndcg@2: 0.977293\tvalid_0's ndcg@3: 0.978843\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979096\n",
+ "[59]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977472\tvalid_0's ndcg@3: 0.978997\tvalid_0's ndcg@4: 0.979202\tvalid_0's ndcg@5: 0.97925\n",
+ "[60]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.97763\tvalid_0's ndcg@3: 0.979118\tvalid_0's ndcg@4: 0.979322\tvalid_0's ndcg@5: 0.979371\n",
+ "[61]\tvalid_0's ndcg@1: 0.945725\tvalid_0's ndcg@2: 0.977682\tvalid_0's ndcg@3: 0.979194\tvalid_0's ndcg@4: 0.979399\tvalid_0's ndcg@5: 0.979447\n",
+ "[62]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977812\tvalid_0's ndcg@3: 0.979312\tvalid_0's ndcg@4: 0.979495\tvalid_0's ndcg@5: 0.979543\n",
+ "[63]\tvalid_0's ndcg@1: 0.946\tvalid_0's ndcg@2: 0.977878\tvalid_0's ndcg@3: 0.97934\tvalid_0's ndcg@4: 0.979523\tvalid_0's ndcg@5: 0.979572\n",
+ "[64]\tvalid_0's ndcg@1: 0.946525\tvalid_0's ndcg@2: 0.978056\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979714\tvalid_0's ndcg@5: 0.979762\n",
+ "[65]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.978105\tvalid_0's ndcg@3: 0.979592\tvalid_0's ndcg@4: 0.979775\tvalid_0's ndcg@5: 0.979823\n",
+ "[66]\tvalid_0's ndcg@1: 0.9465\tvalid_0's ndcg@2: 0.978046\tvalid_0's ndcg@3: 0.979534\tvalid_0's ndcg@4: 0.979706\tvalid_0's ndcg@5: 0.979755\n",
+ "[67]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.978127\tvalid_0's ndcg@3: 0.979614\tvalid_0's ndcg@4: 0.979776\tvalid_0's ndcg@5: 0.979824\n",
+ "[68]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.979608\tvalid_0's ndcg@4: 0.97978\tvalid_0's ndcg@5: 0.979828\n",
+ "[69]\tvalid_0's ndcg@1: 0.946875\tvalid_0's ndcg@2: 0.978216\tvalid_0's ndcg@3: 0.979679\tvalid_0's ndcg@4: 0.979851\tvalid_0's ndcg@5: 0.9799\n",
+ "[70]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.978194\tvalid_0's ndcg@3: 0.979682\tvalid_0's ndcg@4: 0.979854\tvalid_0's ndcg@5: 0.979902\n",
+ "[71]\tvalid_0's ndcg@1: 0.947025\tvalid_0's ndcg@2: 0.978209\tvalid_0's ndcg@3: 0.979721\tvalid_0's ndcg@4: 0.979893\tvalid_0's ndcg@5: 0.979942\n",
+ "[72]\tvalid_0's ndcg@1: 0.9472\tvalid_0's ndcg@2: 0.978273\tvalid_0's ndcg@3: 0.979773\tvalid_0's ndcg@4: 0.979956\tvalid_0's ndcg@5: 0.980005\n",
+ "[73]\tvalid_0's ndcg@1: 0.947475\tvalid_0's ndcg@2: 0.978391\tvalid_0's ndcg@3: 0.979878\tvalid_0's ndcg@4: 0.980061\tvalid_0's ndcg@5: 0.980109\n",
+ "[74]\tvalid_0's ndcg@1: 0.94715\tvalid_0's ndcg@2: 0.978271\tvalid_0's ndcg@3: 0.979758\tvalid_0's ndcg@4: 0.979941\tvalid_0's ndcg@5: 0.97999\n",
+ "[75]\tvalid_0's ndcg@1: 0.947275\tvalid_0's ndcg@2: 0.978333\tvalid_0's ndcg@3: 0.979808\tvalid_0's ndcg@4: 0.979991\tvalid_0's ndcg@5: 0.980039\n",
+ "[76]\tvalid_0's ndcg@1: 0.9474\tvalid_0's ndcg@2: 0.97841\tvalid_0's ndcg@3: 0.979873\tvalid_0's ndcg@4: 0.980045\tvalid_0's ndcg@5: 0.980093\n",
+ "[77]\tvalid_0's ndcg@1: 0.94745\tvalid_0's ndcg@2: 0.97846\tvalid_0's ndcg@3: 0.979898\tvalid_0's ndcg@4: 0.98007\tvalid_0's ndcg@5: 0.980118\n",
+ "[78]\tvalid_0's ndcg@1: 0.94775\tvalid_0's ndcg@2: 0.978555\tvalid_0's ndcg@3: 0.980005\tvalid_0's ndcg@4: 0.980177\tvalid_0's ndcg@5: 0.980226\n",
+ "[79]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
+ "[80]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
+ "[81]\tvalid_0's ndcg@1: 0.948175\tvalid_0's ndcg@2: 0.978744\tvalid_0's ndcg@3: 0.980169\tvalid_0's ndcg@4: 0.980352\tvalid_0's ndcg@5: 0.98039\n",
+ "[82]\tvalid_0's ndcg@1: 0.948375\tvalid_0's ndcg@2: 0.97888\tvalid_0's ndcg@3: 0.980255\tvalid_0's ndcg@4: 0.980438\tvalid_0's ndcg@5: 0.980477\n",
+ "[83]\tvalid_0's ndcg@1: 0.94825\tvalid_0's ndcg@2: 0.978834\tvalid_0's ndcg@3: 0.980209\tvalid_0's ndcg@4: 0.980392\tvalid_0's ndcg@5: 0.980431\n",
+ "[84]\tvalid_0's ndcg@1: 0.948275\tvalid_0's ndcg@2: 0.978844\tvalid_0's ndcg@3: 0.980219\tvalid_0's ndcg@4: 0.980402\tvalid_0's ndcg@5: 0.98044\n",
+ "[85]\tvalid_0's ndcg@1: 0.948475\tvalid_0's ndcg@2: 0.978917\tvalid_0's ndcg@3: 0.980292\tvalid_0's ndcg@4: 0.980475\tvalid_0's ndcg@5: 0.980514\n",
+ "[86]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979102\tvalid_0's ndcg@3: 0.980477\tvalid_0's ndcg@4: 0.98066\tvalid_0's ndcg@5: 0.980699\n",
+ "[87]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979086\tvalid_0's ndcg@3: 0.980474\tvalid_0's ndcg@4: 0.980657\tvalid_0's ndcg@5: 0.980695\n",
+ "[88]\tvalid_0's ndcg@1: 0.949025\tvalid_0's ndcg@2: 0.979136\tvalid_0's ndcg@3: 0.980499\tvalid_0's ndcg@4: 0.980682\tvalid_0's ndcg@5: 0.98072\n",
+ "[89]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979285\tvalid_0's ndcg@3: 0.98061\tvalid_0's ndcg@4: 0.980793\tvalid_0's ndcg@5: 0.980832\n",
+ "[90]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
+ "[91]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
+ "[92]\tvalid_0's ndcg@1: 0.9494\tvalid_0's ndcg@2: 0.97929\tvalid_0's ndcg@3: 0.98064\tvalid_0's ndcg@4: 0.980823\tvalid_0's ndcg@5: 0.980862\n",
+ "[93]\tvalid_0's ndcg@1: 0.949375\tvalid_0's ndcg@2: 0.979297\tvalid_0's ndcg@3: 0.980634\tvalid_0's ndcg@4: 0.980817\tvalid_0's ndcg@5: 0.980856\n",
+ "[94]\tvalid_0's ndcg@1: 0.949525\tvalid_0's ndcg@2: 0.979336\tvalid_0's ndcg@3: 0.980686\tvalid_0's ndcg@4: 0.980869\tvalid_0's ndcg@5: 0.980908\n",
+ "[95]\tvalid_0's ndcg@1: 0.949825\tvalid_0's ndcg@2: 0.979416\tvalid_0's ndcg@3: 0.980791\tvalid_0's ndcg@4: 0.980974\tvalid_0's ndcg@5: 0.981012\n",
+ "[96]\tvalid_0's ndcg@1: 0.94975\tvalid_0's ndcg@2: 0.979404\tvalid_0's ndcg@3: 0.980779\tvalid_0's ndcg@4: 0.980951\tvalid_0's ndcg@5: 0.98099\n",
+ "[97]\tvalid_0's ndcg@1: 0.950025\tvalid_0's ndcg@2: 0.979537\tvalid_0's ndcg@3: 0.980874\tvalid_0's ndcg@4: 0.981057\tvalid_0's ndcg@5: 0.981096\n",
+ "[98]\tvalid_0's ndcg@1: 0.9501\tvalid_0's ndcg@2: 0.979564\tvalid_0's ndcg@3: 0.980889\tvalid_0's ndcg@4: 0.981083\tvalid_0's ndcg@5: 0.981122\n",
+ "[99]\tvalid_0's ndcg@1: 0.950275\tvalid_0's ndcg@2: 0.979629\tvalid_0's ndcg@3: 0.980967\tvalid_0's ndcg@4: 0.98115\tvalid_0's ndcg@5: 0.981188\n",
+ "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
+ "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.96382\tvalid_0's ndcg@3: 0.965707\tvalid_0's ndcg@4: 0.966009\tvalid_0's ndcg@5: 0.966086\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.91415\tvalid_0's ndcg@2: 0.965492\tvalid_0's ndcg@3: 0.967254\tvalid_0's ndcg@4: 0.967556\tvalid_0's ndcg@5: 0.967604\n",
+ "[3]\tvalid_0's ndcg@1: 0.916025\tvalid_0's ndcg@2: 0.966389\tvalid_0's ndcg@3: 0.967976\tvalid_0's ndcg@4: 0.968278\tvalid_0's ndcg@5: 0.968355\n",
+ "[4]\tvalid_0's ndcg@1: 0.919\tvalid_0's ndcg@2: 0.967392\tvalid_0's ndcg@3: 0.96903\tvalid_0's ndcg@4: 0.969364\tvalid_0's ndcg@5: 0.969431\n",
+ "[5]\tvalid_0's ndcg@1: 0.921125\tvalid_0's ndcg@2: 0.968192\tvalid_0's ndcg@3: 0.969855\tvalid_0's ndcg@4: 0.970156\tvalid_0's ndcg@5: 0.970224\n",
+ "[6]\tvalid_0's ndcg@1: 0.921675\tvalid_0's ndcg@2: 0.968411\tvalid_0's ndcg@3: 0.970111\tvalid_0's ndcg@4: 0.97037\tvalid_0's ndcg@5: 0.970437\n",
+ "[7]\tvalid_0's ndcg@1: 0.9237\tvalid_0's ndcg@2: 0.969332\tvalid_0's ndcg@3: 0.970882\tvalid_0's ndcg@4: 0.97113\tvalid_0's ndcg@5: 0.971217\n",
+ "[8]\tvalid_0's ndcg@1: 0.925775\tvalid_0's ndcg@2: 0.970129\tvalid_0's ndcg@3: 0.971642\tvalid_0's ndcg@4: 0.971922\tvalid_0's ndcg@5: 0.97199\n",
+ "[9]\tvalid_0's ndcg@1: 0.926775\tvalid_0's ndcg@2: 0.970435\tvalid_0's ndcg@3: 0.971985\tvalid_0's ndcg@4: 0.972276\tvalid_0's ndcg@5: 0.972334\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[10]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970761\tvalid_0's ndcg@3: 0.972311\tvalid_0's ndcg@4: 0.972612\tvalid_0's ndcg@5: 0.97267\n",
+ "[11]\tvalid_0's ndcg@1: 0.928975\tvalid_0's ndcg@2: 0.97131\tvalid_0's ndcg@3: 0.972798\tvalid_0's ndcg@4: 0.973089\tvalid_0's ndcg@5: 0.973166\n",
+ "[12]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971505\tvalid_0's ndcg@3: 0.972968\tvalid_0's ndcg@4: 0.973259\tvalid_0's ndcg@5: 0.973326\n",
+ "[13]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971426\tvalid_0's ndcg@3: 0.972939\tvalid_0's ndcg@4: 0.97324\tvalid_0's ndcg@5: 0.973318\n",
+ "[14]\tvalid_0's ndcg@1: 0.929775\tvalid_0's ndcg@2: 0.971621\tvalid_0's ndcg@3: 0.973121\tvalid_0's ndcg@4: 0.973412\tvalid_0's ndcg@5: 0.97348\n",
+ "[15]\tvalid_0's ndcg@1: 0.9304\tvalid_0's ndcg@2: 0.971868\tvalid_0's ndcg@3: 0.97338\tvalid_0's ndcg@4: 0.97365\tvalid_0's ndcg@5: 0.973717\n",
+ "[16]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.972096\tvalid_0's ndcg@3: 0.973558\tvalid_0's ndcg@4: 0.973849\tvalid_0's ndcg@5: 0.973926\n",
+ "[17]\tvalid_0's ndcg@1: 0.93105\tvalid_0's ndcg@2: 0.972108\tvalid_0's ndcg@3: 0.973583\tvalid_0's ndcg@4: 0.973884\tvalid_0's ndcg@5: 0.973952\n",
+ "[18]\tvalid_0's ndcg@1: 0.931725\tvalid_0's ndcg@2: 0.972373\tvalid_0's ndcg@3: 0.97386\tvalid_0's ndcg@4: 0.974129\tvalid_0's ndcg@5: 0.974207\n",
+ "[19]\tvalid_0's ndcg@1: 0.932175\tvalid_0's ndcg@2: 0.972681\tvalid_0's ndcg@3: 0.974068\tvalid_0's ndcg@4: 0.974348\tvalid_0's ndcg@5: 0.974406\n",
+ "[20]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.973019\tvalid_0's ndcg@3: 0.974382\tvalid_0's ndcg@4: 0.974673\tvalid_0's ndcg@5: 0.974731\n",
+ "[21]\tvalid_0's ndcg@1: 0.933075\tvalid_0's ndcg@2: 0.97306\tvalid_0's ndcg@3: 0.974423\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.97477\n",
+ "[22]\tvalid_0's ndcg@1: 0.93375\tvalid_0's ndcg@2: 0.973262\tvalid_0's ndcg@3: 0.974649\tvalid_0's ndcg@4: 0.974929\tvalid_0's ndcg@5: 0.975007\n",
+ "[23]\tvalid_0's ndcg@1: 0.933675\tvalid_0's ndcg@2: 0.973219\tvalid_0's ndcg@3: 0.974606\tvalid_0's ndcg@4: 0.974886\tvalid_0's ndcg@5: 0.974973\n",
+ "[24]\tvalid_0's ndcg@1: 0.934\tvalid_0's ndcg@2: 0.97337\tvalid_0's ndcg@3: 0.974745\tvalid_0's ndcg@4: 0.975014\tvalid_0's ndcg@5: 0.975101\n",
+ "[25]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973674\tvalid_0's ndcg@3: 0.975062\tvalid_0's ndcg@4: 0.975342\tvalid_0's ndcg@5: 0.97541\n",
+ "[26]\tvalid_0's ndcg@1: 0.93495\tvalid_0's ndcg@2: 0.973721\tvalid_0's ndcg@3: 0.975096\tvalid_0's ndcg@4: 0.975365\tvalid_0's ndcg@5: 0.975452\n",
+ "[27]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.974082\tvalid_0's ndcg@3: 0.975444\tvalid_0's ndcg@4: 0.975713\tvalid_0's ndcg@5: 0.975781\n",
+ "[28]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973875\tvalid_0's ndcg@3: 0.975275\tvalid_0's ndcg@4: 0.975512\tvalid_0's ndcg@5: 0.975599\n",
+ "[29]\tvalid_0's ndcg@1: 0.935925\tvalid_0's ndcg@2: 0.974159\tvalid_0's ndcg@3: 0.975522\tvalid_0's ndcg@4: 0.975759\tvalid_0's ndcg@5: 0.975836\n",
+ "[30]\tvalid_0's ndcg@1: 0.9362\tvalid_0's ndcg@2: 0.974214\tvalid_0's ndcg@3: 0.975589\tvalid_0's ndcg@4: 0.975847\tvalid_0's ndcg@5: 0.975924\n",
+ "[31]\tvalid_0's ndcg@1: 0.93625\tvalid_0's ndcg@2: 0.974216\tvalid_0's ndcg@3: 0.975629\tvalid_0's ndcg@4: 0.975876\tvalid_0's ndcg@5: 0.975944\n",
+ "[32]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974427\tvalid_0's ndcg@3: 0.975814\tvalid_0's ndcg@4: 0.97603\tvalid_0's ndcg@5: 0.976107\n",
+ "[33]\tvalid_0's ndcg@1: 0.936775\tvalid_0's ndcg@2: 0.974505\tvalid_0's ndcg@3: 0.975855\tvalid_0's ndcg@4: 0.976081\tvalid_0's ndcg@5: 0.976158\n",
+ "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974643\tvalid_0's ndcg@3: 0.975993\tvalid_0's ndcg@4: 0.976219\tvalid_0's ndcg@5: 0.976296\n",
+ "[35]\tvalid_0's ndcg@1: 0.937675\tvalid_0's ndcg@2: 0.974805\tvalid_0's ndcg@3: 0.97618\tvalid_0's ndcg@4: 0.976406\tvalid_0's ndcg@5: 0.976484\n",
+ "[36]\tvalid_0's ndcg@1: 0.9382\tvalid_0's ndcg@2: 0.974983\tvalid_0's ndcg@3: 0.976371\tvalid_0's ndcg@4: 0.976597\tvalid_0's ndcg@5: 0.976674\n",
+ "[37]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.974974\tvalid_0's ndcg@3: 0.976349\tvalid_0's ndcg@4: 0.976586\tvalid_0's ndcg@5: 0.976663\n",
+ "[38]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.975143\tvalid_0's ndcg@3: 0.976518\tvalid_0's ndcg@4: 0.976776\tvalid_0's ndcg@5: 0.976844\n",
+ "[39]\tvalid_0's ndcg@1: 0.938575\tvalid_0's ndcg@2: 0.975106\tvalid_0's ndcg@3: 0.976481\tvalid_0's ndcg@4: 0.976739\tvalid_0's ndcg@5: 0.976807\n",
+ "[40]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.97519\tvalid_0's ndcg@3: 0.976528\tvalid_0's ndcg@4: 0.976775\tvalid_0's ndcg@5: 0.976853\n",
+ "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975347\tvalid_0's ndcg@3: 0.976697\tvalid_0's ndcg@4: 0.976934\tvalid_0's ndcg@5: 0.977001\n",
+ "[42]\tvalid_0's ndcg@1: 0.939825\tvalid_0's ndcg@2: 0.975599\tvalid_0's ndcg@3: 0.976961\tvalid_0's ndcg@4: 0.977198\tvalid_0's ndcg@5: 0.977266\n",
+ "[43]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975639\tvalid_0's ndcg@3: 0.976977\tvalid_0's ndcg@4: 0.977214\tvalid_0's ndcg@5: 0.977282\n",
+ "[44]\tvalid_0's ndcg@1: 0.9398\tvalid_0's ndcg@2: 0.975605\tvalid_0's ndcg@3: 0.976955\tvalid_0's ndcg@4: 0.977192\tvalid_0's ndcg@5: 0.97726\n",
+ "[45]\tvalid_0's ndcg@1: 0.9401\tvalid_0's ndcg@2: 0.9757\tvalid_0's ndcg@3: 0.977075\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
+ "[46]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975845\tvalid_0's ndcg@3: 0.977183\tvalid_0's ndcg@4: 0.97742\tvalid_0's ndcg@5: 0.977497\n",
+ "[47]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975854\tvalid_0's ndcg@3: 0.977204\tvalid_0's ndcg@4: 0.97743\tvalid_0's ndcg@5: 0.977508\n",
+ "[48]\tvalid_0's ndcg@1: 0.940575\tvalid_0's ndcg@2: 0.975923\tvalid_0's ndcg@3: 0.977273\tvalid_0's ndcg@4: 0.977488\tvalid_0's ndcg@5: 0.977556\n",
+ "[49]\tvalid_0's ndcg@1: 0.9407\tvalid_0's ndcg@2: 0.975922\tvalid_0's ndcg@3: 0.977297\tvalid_0's ndcg@4: 0.977501\tvalid_0's ndcg@5: 0.977588\n",
+ "[50]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977322\tvalid_0's ndcg@4: 0.977505\tvalid_0's ndcg@5: 0.977592\n",
+ "[51]\tvalid_0's ndcg@1: 0.9406\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.97725\tvalid_0's ndcg@4: 0.977422\tvalid_0's ndcg@5: 0.977509\n",
+ "[52]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975997\tvalid_0's ndcg@3: 0.977422\tvalid_0's ndcg@4: 0.977594\tvalid_0's ndcg@5: 0.977691\n",
+ "[53]\tvalid_0's ndcg@1: 0.940925\tvalid_0's ndcg@2: 0.975989\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977538\tvalid_0's ndcg@5: 0.977644\n",
+ "[54]\tvalid_0's ndcg@1: 0.94125\tvalid_0's ndcg@2: 0.976062\tvalid_0's ndcg@3: 0.977487\tvalid_0's ndcg@4: 0.977659\tvalid_0's ndcg@5: 0.977756\n",
+ "[55]\tvalid_0's ndcg@1: 0.94145\tvalid_0's ndcg@2: 0.976183\tvalid_0's ndcg@3: 0.97757\tvalid_0's ndcg@4: 0.977742\tvalid_0's ndcg@5: 0.977839\n",
+ "[56]\tvalid_0's ndcg@1: 0.941475\tvalid_0's ndcg@2: 0.976176\tvalid_0's ndcg@3: 0.977576\tvalid_0's ndcg@4: 0.977748\tvalid_0's ndcg@5: 0.977845\n",
+ "[57]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976139\tvalid_0's ndcg@3: 0.977539\tvalid_0's ndcg@4: 0.977712\tvalid_0's ndcg@5: 0.977808\n",
+ "[58]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.97625\tvalid_0's ndcg@3: 0.97765\tvalid_0's ndcg@4: 0.977822\tvalid_0's ndcg@5: 0.977919\n",
+ "[59]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.976253\tvalid_0's ndcg@3: 0.977653\tvalid_0's ndcg@4: 0.977836\tvalid_0's ndcg@5: 0.977932\n",
+ "[60]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977634\tvalid_0's ndcg@4: 0.977817\tvalid_0's ndcg@5: 0.977914\n",
+ "[61]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.976333\tvalid_0's ndcg@3: 0.977745\tvalid_0's ndcg@4: 0.977918\tvalid_0's ndcg@5: 0.978005\n",
+ "[62]\tvalid_0's ndcg@1: 0.941975\tvalid_0's ndcg@2: 0.976345\tvalid_0's ndcg@3: 0.977757\tvalid_0's ndcg@4: 0.97794\tvalid_0's ndcg@5: 0.978027\n",
+ "[63]\tvalid_0's ndcg@1: 0.9423\tvalid_0's ndcg@2: 0.976496\tvalid_0's ndcg@3: 0.977871\tvalid_0's ndcg@4: 0.978065\tvalid_0's ndcg@5: 0.978152\n",
+ "[64]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976632\tvalid_0's ndcg@3: 0.977995\tvalid_0's ndcg@4: 0.978188\tvalid_0's ndcg@5: 0.978275\n",
+ "[65]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976629\tvalid_0's ndcg@3: 0.977979\tvalid_0's ndcg@4: 0.978173\tvalid_0's ndcg@5: 0.97826\n",
+ "[66]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976685\tvalid_0's ndcg@3: 0.978035\tvalid_0's ndcg@4: 0.978229\tvalid_0's ndcg@5: 0.978316\n",
+ "[67]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976678\tvalid_0's ndcg@3: 0.978041\tvalid_0's ndcg@4: 0.978224\tvalid_0's ndcg@5: 0.97832\n",
+ "[68]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976694\tvalid_0's ndcg@3: 0.978044\tvalid_0's ndcg@4: 0.978227\tvalid_0's ndcg@5: 0.978324\n",
+ "[69]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976834\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978329\tvalid_0's ndcg@5: 0.978426\n",
+ "[70]\tvalid_0's ndcg@1: 0.943025\tvalid_0's ndcg@2: 0.976827\tvalid_0's ndcg@3: 0.978152\tvalid_0's ndcg@4: 0.978324\tvalid_0's ndcg@5: 0.978431\n",
+ "[71]\tvalid_0's ndcg@1: 0.9432\tvalid_0's ndcg@2: 0.976923\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978504\n",
+ "[72]\tvalid_0's ndcg@1: 0.943225\tvalid_0's ndcg@2: 0.976917\tvalid_0's ndcg@3: 0.978254\tvalid_0's ndcg@4: 0.978405\tvalid_0's ndcg@5: 0.978511\n",
+ "[73]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976936\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978409\tvalid_0's ndcg@5: 0.978496\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[74]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976957\tvalid_0's ndcg@3: 0.97827\tvalid_0's ndcg@4: 0.978431\tvalid_0's ndcg@5: 0.978528\n",
+ "[75]\tvalid_0's ndcg@1: 0.943075\tvalid_0's ndcg@2: 0.976861\tvalid_0's ndcg@3: 0.978199\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978457\n",
+ "[76]\tvalid_0's ndcg@1: 0.94335\tvalid_0's ndcg@2: 0.976963\tvalid_0's ndcg@3: 0.978288\tvalid_0's ndcg@4: 0.978471\tvalid_0's ndcg@5: 0.978568\n",
+ "[77]\tvalid_0's ndcg@1: 0.94345\tvalid_0's ndcg@2: 0.977031\tvalid_0's ndcg@3: 0.978331\tvalid_0's ndcg@4: 0.978514\tvalid_0's ndcg@5: 0.978611\n",
+ "[78]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.977088\tvalid_0's ndcg@3: 0.97835\tvalid_0's ndcg@4: 0.978533\tvalid_0's ndcg@5: 0.97863\n",
+ "[79]\tvalid_0's ndcg@1: 0.943625\tvalid_0's ndcg@2: 0.977096\tvalid_0's ndcg@3: 0.978396\tvalid_0's ndcg@4: 0.978579\tvalid_0's ndcg@5: 0.978676\n",
+ "[80]\tvalid_0's ndcg@1: 0.943825\tvalid_0's ndcg@2: 0.977154\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978651\tvalid_0's ndcg@5: 0.978748\n",
+ "[81]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.97846\tvalid_0's ndcg@4: 0.978633\tvalid_0's ndcg@5: 0.978729\n",
+ "[82]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.977361\tvalid_0's ndcg@3: 0.978673\tvalid_0's ndcg@4: 0.978845\tvalid_0's ndcg@5: 0.978933\n",
+ "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977324\tvalid_0's ndcg@3: 0.978624\tvalid_0's ndcg@4: 0.978796\tvalid_0's ndcg@5: 0.978893\n",
+ "[84]\tvalid_0's ndcg@1: 0.94405\tvalid_0's ndcg@2: 0.977253\tvalid_0's ndcg@3: 0.978565\tvalid_0's ndcg@4: 0.978737\tvalid_0's ndcg@5: 0.978834\n",
+ "[85]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977283\tvalid_0's ndcg@3: 0.978633\tvalid_0's ndcg@4: 0.978795\tvalid_0's ndcg@5: 0.978882\n",
+ "[86]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.97745\tvalid_0's ndcg@3: 0.978763\tvalid_0's ndcg@4: 0.978924\tvalid_0's ndcg@5: 0.979011\n",
+ "[87]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.977419\tvalid_0's ndcg@3: 0.978756\tvalid_0's ndcg@4: 0.978918\tvalid_0's ndcg@5: 0.979005\n",
+ "[88]\tvalid_0's ndcg@1: 0.944825\tvalid_0's ndcg@2: 0.977554\tvalid_0's ndcg@3: 0.978867\tvalid_0's ndcg@4: 0.979039\tvalid_0's ndcg@5: 0.979126\n",
+ "[89]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977767\tvalid_0's ndcg@3: 0.979079\tvalid_0's ndcg@4: 0.979262\tvalid_0's ndcg@5: 0.97934\n",
+ "[90]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977773\tvalid_0's ndcg@3: 0.979073\tvalid_0's ndcg@4: 0.979256\tvalid_0's ndcg@5: 0.979334\n",
+ "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977792\tvalid_0's ndcg@3: 0.979092\tvalid_0's ndcg@4: 0.979275\tvalid_0's ndcg@5: 0.979352\n",
+ "[92]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977776\tvalid_0's ndcg@3: 0.979088\tvalid_0's ndcg@4: 0.979261\tvalid_0's ndcg@5: 0.979348\n",
+ "[93]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977757\tvalid_0's ndcg@3: 0.979082\tvalid_0's ndcg@4: 0.979244\tvalid_0's ndcg@5: 0.979331\n",
+ "[94]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977761\tvalid_0's ndcg@3: 0.979061\tvalid_0's ndcg@4: 0.979223\tvalid_0's ndcg@5: 0.97931\n",
+ "[95]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977798\tvalid_0's ndcg@3: 0.979086\tvalid_0's ndcg@4: 0.979258\tvalid_0's ndcg@5: 0.979345\n",
+ "[96]\tvalid_0's ndcg@1: 0.945825\tvalid_0's ndcg@2: 0.977955\tvalid_0's ndcg@3: 0.97923\tvalid_0's ndcg@4: 0.979413\tvalid_0's ndcg@5: 0.9795\n",
+ "[97]\tvalid_0's ndcg@1: 0.945925\tvalid_0's ndcg@2: 0.97796\tvalid_0's ndcg@3: 0.97926\tvalid_0's ndcg@4: 0.979443\tvalid_0's ndcg@5: 0.979531\n",
+ "[98]\tvalid_0's ndcg@1: 0.9464\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.97942\tvalid_0's ndcg@4: 0.979625\tvalid_0's ndcg@5: 0.979702\n",
+ "[99]\tvalid_0's ndcg@1: 0.94655\tvalid_0's ndcg@2: 0.978191\tvalid_0's ndcg@3: 0.979479\tvalid_0's ndcg@4: 0.979683\tvalid_0's ndcg@5: 0.97977\n",
+ "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
+ "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.963031\tvalid_0's ndcg@3: 0.965281\tvalid_0's ndcg@4: 0.965819\tvalid_0's ndcg@5: 0.965887\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9141\tvalid_0's ndcg@2: 0.964748\tvalid_0's ndcg@3: 0.96681\tvalid_0's ndcg@4: 0.967316\tvalid_0's ndcg@5: 0.967394\n",
+ "[3]\tvalid_0's ndcg@1: 0.915925\tvalid_0's ndcg@2: 0.9655\tvalid_0's ndcg@3: 0.967575\tvalid_0's ndcg@4: 0.968028\tvalid_0's ndcg@5: 0.968105\n",
+ "[4]\tvalid_0's ndcg@1: 0.91915\tvalid_0's ndcg@2: 0.966943\tvalid_0's ndcg@3: 0.968968\tvalid_0's ndcg@4: 0.969334\tvalid_0's ndcg@5: 0.969373\n",
+ "[5]\tvalid_0's ndcg@1: 0.920625\tvalid_0's ndcg@2: 0.967598\tvalid_0's ndcg@3: 0.969498\tvalid_0's ndcg@4: 0.969896\tvalid_0's ndcg@5: 0.969944\n",
+ "[6]\tvalid_0's ndcg@1: 0.922625\tvalid_0's ndcg@2: 0.968336\tvalid_0's ndcg@3: 0.970261\tvalid_0's ndcg@4: 0.970659\tvalid_0's ndcg@5: 0.970688\n",
+ "[7]\tvalid_0's ndcg@1: 0.923625\tvalid_0's ndcg@2: 0.968768\tvalid_0's ndcg@3: 0.970656\tvalid_0's ndcg@4: 0.971043\tvalid_0's ndcg@5: 0.971072\n",
+ "[8]\tvalid_0's ndcg@1: 0.925825\tvalid_0's ndcg@2: 0.969612\tvalid_0's ndcg@3: 0.971462\tvalid_0's ndcg@4: 0.97186\tvalid_0's ndcg@5: 0.971879\n",
+ "[9]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.969899\tvalid_0's ndcg@3: 0.971711\tvalid_0's ndcg@4: 0.97211\tvalid_0's ndcg@5: 0.972129\n",
+ "[10]\tvalid_0's ndcg@1: 0.927775\tvalid_0's ndcg@2: 0.97041\tvalid_0's ndcg@3: 0.972185\tvalid_0's ndcg@4: 0.972594\tvalid_0's ndcg@5: 0.972614\n",
+ "[11]\tvalid_0's ndcg@1: 0.92885\tvalid_0's ndcg@2: 0.970838\tvalid_0's ndcg@3: 0.972588\tvalid_0's ndcg@4: 0.973008\tvalid_0's ndcg@5: 0.973028\n",
+ "[12]\tvalid_0's ndcg@1: 0.930325\tvalid_0's ndcg@2: 0.971367\tvalid_0's ndcg@3: 0.973129\tvalid_0's ndcg@4: 0.973549\tvalid_0's ndcg@5: 0.973569\n",
+ "[13]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971631\tvalid_0's ndcg@3: 0.973443\tvalid_0's ndcg@4: 0.973842\tvalid_0's ndcg@5: 0.973871\n",
+ "[14]\tvalid_0's ndcg@1: 0.931525\tvalid_0's ndcg@2: 0.971778\tvalid_0's ndcg@3: 0.973616\tvalid_0's ndcg@4: 0.973993\tvalid_0's ndcg@5: 0.974022\n",
+ "[15]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.9717\tvalid_0's ndcg@3: 0.973475\tvalid_0's ndcg@4: 0.973852\tvalid_0's ndcg@5: 0.973872\n",
+ "[16]\tvalid_0's ndcg@1: 0.931775\tvalid_0's ndcg@2: 0.971902\tvalid_0's ndcg@3: 0.973702\tvalid_0's ndcg@4: 0.97409\tvalid_0's ndcg@5: 0.974109\n",
+ "[17]\tvalid_0's ndcg@1: 0.931425\tvalid_0's ndcg@2: 0.971805\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973967\tvalid_0's ndcg@5: 0.973986\n",
+ "[18]\tvalid_0's ndcg@1: 0.931575\tvalid_0's ndcg@2: 0.971876\tvalid_0's ndcg@3: 0.973651\tvalid_0's ndcg@4: 0.974027\tvalid_0's ndcg@5: 0.974047\n",
+ "[19]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.97208\tvalid_0's ndcg@3: 0.973805\tvalid_0's ndcg@4: 0.974192\tvalid_0's ndcg@5: 0.974212\n",
+ "[20]\tvalid_0's ndcg@1: 0.932075\tvalid_0's ndcg@2: 0.972092\tvalid_0's ndcg@3: 0.973829\tvalid_0's ndcg@4: 0.974217\tvalid_0's ndcg@5: 0.974236\n",
+ "[21]\tvalid_0's ndcg@1: 0.932675\tvalid_0's ndcg@2: 0.972282\tvalid_0's ndcg@3: 0.974057\tvalid_0's ndcg@4: 0.974444\tvalid_0's ndcg@5: 0.974454\n",
+ "[22]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972358\tvalid_0's ndcg@3: 0.974146\tvalid_0's ndcg@4: 0.974533\tvalid_0's ndcg@5: 0.974543\n",
+ "[23]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972478\tvalid_0's ndcg@3: 0.974253\tvalid_0's ndcg@4: 0.974651\tvalid_0's ndcg@5: 0.974661\n",
+ "[24]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972539\tvalid_0's ndcg@3: 0.974351\tvalid_0's ndcg@4: 0.974739\tvalid_0's ndcg@5: 0.974749\n",
+ "[25]\tvalid_0's ndcg@1: 0.93475\tvalid_0's ndcg@2: 0.973\tvalid_0's ndcg@3: 0.974788\tvalid_0's ndcg@4: 0.975197\tvalid_0's ndcg@5: 0.975206\n",
+ "[26]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.97312\tvalid_0's ndcg@3: 0.974895\tvalid_0's ndcg@4: 0.975315\tvalid_0's ndcg@5: 0.975325\n",
+ "[27]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.974865\tvalid_0's ndcg@4: 0.975264\tvalid_0's ndcg@5: 0.975273\n",
+ "[28]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974939\tvalid_0's ndcg@4: 0.975327\tvalid_0's ndcg@5: 0.975336\n",
+ "[29]\tvalid_0's ndcg@1: 0.935475\tvalid_0's ndcg@2: 0.973315\tvalid_0's ndcg@3: 0.975128\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975492\n",
+ "[30]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973522\tvalid_0's ndcg@3: 0.975297\tvalid_0's ndcg@4: 0.975663\tvalid_0's ndcg@5: 0.975673\n",
+ "[31]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973506\tvalid_0's ndcg@3: 0.975281\tvalid_0's ndcg@4: 0.975658\tvalid_0's ndcg@5: 0.975668\n",
+ "[32]\tvalid_0's ndcg@1: 0.93675\tvalid_0's ndcg@2: 0.973833\tvalid_0's ndcg@3: 0.975595\tvalid_0's ndcg@4: 0.975961\tvalid_0's ndcg@5: 0.975971\n",
+ "[33]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.973763\tvalid_0's ndcg@3: 0.975488\tvalid_0's ndcg@4: 0.975865\tvalid_0's ndcg@5: 0.975874\n",
+ "[34]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973893\tvalid_0's ndcg@3: 0.975568\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975966\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[35]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974059\tvalid_0's ndcg@3: 0.975722\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.97613\n",
+ "[36]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
+ "[37]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
+ "[38]\tvalid_0's ndcg@1: 0.938725\tvalid_0's ndcg@2: 0.974672\tvalid_0's ndcg@3: 0.97636\tvalid_0's ndcg@4: 0.976715\tvalid_0's ndcg@5: 0.976725\n",
+ "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974676\tvalid_0's ndcg@3: 0.976364\tvalid_0's ndcg@4: 0.976697\tvalid_0's ndcg@5: 0.976707\n",
+ "[40]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.974867\tvalid_0's ndcg@3: 0.97653\tvalid_0's ndcg@4: 0.976874\tvalid_0's ndcg@5: 0.976884\n",
+ "[41]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975042\tvalid_0's ndcg@3: 0.976705\tvalid_0's ndcg@4: 0.97705\tvalid_0's ndcg@5: 0.977059\n",
+ "[42]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976784\tvalid_0's ndcg@4: 0.977129\tvalid_0's ndcg@5: 0.977138\n",
+ "[43]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.97517\tvalid_0's ndcg@3: 0.97687\tvalid_0's ndcg@4: 0.977215\tvalid_0's ndcg@5: 0.977225\n",
+ "[44]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.97534\tvalid_0's ndcg@3: 0.977015\tvalid_0's ndcg@4: 0.97736\tvalid_0's ndcg@5: 0.97737\n",
+ "[45]\tvalid_0's ndcg@1: 0.94055\tvalid_0's ndcg@2: 0.975409\tvalid_0's ndcg@3: 0.977059\tvalid_0's ndcg@4: 0.977403\tvalid_0's ndcg@5: 0.977413\n",
+ "[46]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975415\tvalid_0's ndcg@3: 0.97704\tvalid_0's ndcg@4: 0.977396\tvalid_0's ndcg@5: 0.977405\n",
+ "[47]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975363\tvalid_0's ndcg@3: 0.977013\tvalid_0's ndcg@4: 0.977357\tvalid_0's ndcg@5: 0.977367\n",
+ "[48]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975388\tvalid_0's ndcg@3: 0.977025\tvalid_0's ndcg@4: 0.97737\tvalid_0's ndcg@5: 0.977379\n",
+ "[49]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975447\tvalid_0's ndcg@3: 0.977097\tvalid_0's ndcg@4: 0.977409\tvalid_0's ndcg@5: 0.977419\n",
+ "[50]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975666\tvalid_0's ndcg@3: 0.977303\tvalid_0's ndcg@4: 0.977615\tvalid_0's ndcg@5: 0.977625\n",
+ "[51]\tvalid_0's ndcg@1: 0.94135\tvalid_0's ndcg@2: 0.975751\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.97771\tvalid_0's ndcg@5: 0.97772\n",
+ "[52]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.975717\tvalid_0's ndcg@3: 0.977355\tvalid_0's ndcg@4: 0.977688\tvalid_0's ndcg@5: 0.977698\n",
+ "[53]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.975713\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977699\tvalid_0's ndcg@5: 0.977718\n",
+ "[54]\tvalid_0's ndcg@1: 0.94185\tvalid_0's ndcg@2: 0.975857\tvalid_0's ndcg@3: 0.977557\tvalid_0's ndcg@4: 0.977869\tvalid_0's ndcg@5: 0.977889\n",
+ "[55]\tvalid_0's ndcg@1: 0.941925\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.9776\tvalid_0's ndcg@4: 0.977891\tvalid_0's ndcg@5: 0.97791\n",
+ "[56]\tvalid_0's ndcg@1: 0.942325\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977719\tvalid_0's ndcg@4: 0.978032\tvalid_0's ndcg@5: 0.978051\n",
+ "[57]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977772\tvalid_0's ndcg@4: 0.978073\tvalid_0's ndcg@5: 0.978093\n",
+ "[58]\tvalid_0's ndcg@1: 0.9425\tvalid_0's ndcg@2: 0.976081\tvalid_0's ndcg@3: 0.977806\tvalid_0's ndcg@4: 0.978108\tvalid_0's ndcg@5: 0.978127\n",
+ "[59]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977788\tvalid_0's ndcg@4: 0.978079\tvalid_0's ndcg@5: 0.978098\n",
+ "[60]\tvalid_0's ndcg@1: 0.942375\tvalid_0's ndcg@2: 0.976067\tvalid_0's ndcg@3: 0.977779\tvalid_0's ndcg@4: 0.97807\tvalid_0's ndcg@5: 0.978089\n",
+ "[61]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976043\tvalid_0's ndcg@3: 0.97773\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.97804\n",
+ "[62]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976117\tvalid_0's ndcg@3: 0.977792\tvalid_0's ndcg@4: 0.978093\tvalid_0's ndcg@5: 0.978112\n",
+ "[63]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977881\tvalid_0's ndcg@4: 0.978182\tvalid_0's ndcg@5: 0.978201\n",
+ "[64]\tvalid_0's ndcg@1: 0.942925\tvalid_0's ndcg@2: 0.976254\tvalid_0's ndcg@3: 0.977966\tvalid_0's ndcg@4: 0.978268\tvalid_0's ndcg@5: 0.978287\n",
+ "[65]\tvalid_0's ndcg@1: 0.9431\tvalid_0's ndcg@2: 0.97635\tvalid_0's ndcg@3: 0.978025\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978357\n",
+ "[66]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976445\tvalid_0's ndcg@3: 0.978132\tvalid_0's ndcg@4: 0.978445\tvalid_0's ndcg@5: 0.978464\n",
+ "[67]\tvalid_0's ndcg@1: 0.943275\tvalid_0's ndcg@2: 0.976399\tvalid_0's ndcg@3: 0.978074\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978416\n",
+ "[68]\tvalid_0's ndcg@1: 0.943325\tvalid_0's ndcg@2: 0.976401\tvalid_0's ndcg@3: 0.978089\tvalid_0's ndcg@4: 0.978412\tvalid_0's ndcg@5: 0.978431\n",
+ "[69]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976578\tvalid_0's ndcg@3: 0.97819\tvalid_0's ndcg@4: 0.978546\tvalid_0's ndcg@5: 0.978565\n",
+ "[70]\tvalid_0's ndcg@1: 0.944025\tvalid_0's ndcg@2: 0.976707\tvalid_0's ndcg@3: 0.97832\tvalid_0's ndcg@4: 0.978675\tvalid_0's ndcg@5: 0.978694\n",
+ "[71]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976772\tvalid_0's ndcg@3: 0.978384\tvalid_0's ndcg@4: 0.97874\tvalid_0's ndcg@5: 0.978759\n",
+ "[72]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978409\tvalid_0's ndcg@4: 0.978765\tvalid_0's ndcg@5: 0.978784\n",
+ "[73]\tvalid_0's ndcg@1: 0.94445\tvalid_0's ndcg@2: 0.976864\tvalid_0's ndcg@3: 0.978464\tvalid_0's ndcg@4: 0.97883\tvalid_0's ndcg@5: 0.978849\n",
+ "[74]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
+ "[75]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
+ "[76]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.97696\tvalid_0's ndcg@3: 0.978535\tvalid_0's ndcg@4: 0.978901\tvalid_0's ndcg@5: 0.978921\n",
+ "[77]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
+ "[78]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
+ "[79]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976907\tvalid_0's ndcg@3: 0.978507\tvalid_0's ndcg@4: 0.978863\tvalid_0's ndcg@5: 0.978882\n",
+ "[80]\tvalid_0's ndcg@1: 0.94455\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97851\tvalid_0's ndcg@4: 0.978865\tvalid_0's ndcg@5: 0.978885\n",
+ "[81]\tvalid_0's ndcg@1: 0.944725\tvalid_0's ndcg@2: 0.97695\tvalid_0's ndcg@3: 0.978575\tvalid_0's ndcg@4: 0.978919\tvalid_0's ndcg@5: 0.978948\n",
+ "[82]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.977103\tvalid_0's ndcg@3: 0.978765\tvalid_0's ndcg@4: 0.97911\tvalid_0's ndcg@5: 0.979129\n",
+ "[83]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977066\tvalid_0's ndcg@3: 0.978716\tvalid_0's ndcg@4: 0.979071\tvalid_0's ndcg@5: 0.97909\n",
+ "[84]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.97715\tvalid_0's ndcg@3: 0.978775\tvalid_0's ndcg@4: 0.97912\tvalid_0's ndcg@5: 0.979139\n",
+ "[85]\tvalid_0's ndcg@1: 0.945025\tvalid_0's ndcg@2: 0.977092\tvalid_0's ndcg@3: 0.978692\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979067\n",
+ "[86]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977172\tvalid_0's ndcg@3: 0.97876\tvalid_0's ndcg@4: 0.979115\tvalid_0's ndcg@5: 0.979135\n",
+ "[87]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.97879\tvalid_0's ndcg@4: 0.979156\tvalid_0's ndcg@5: 0.979166\n",
+ "[88]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.978815\tvalid_0's ndcg@4: 0.979149\tvalid_0's ndcg@5: 0.979168\n",
+ "[89]\tvalid_0's ndcg@1: 0.94555\tvalid_0's ndcg@2: 0.977333\tvalid_0's ndcg@3: 0.978933\tvalid_0's ndcg@4: 0.979267\tvalid_0's ndcg@5: 0.979277\n",
+ "[90]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977462\tvalid_0's ndcg@3: 0.979062\tvalid_0's ndcg@4: 0.979396\tvalid_0's ndcg@5: 0.979406\n",
+ "[91]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977481\tvalid_0's ndcg@3: 0.979081\tvalid_0's ndcg@4: 0.979414\tvalid_0's ndcg@5: 0.979424\n",
+ "[92]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977437\tvalid_0's ndcg@3: 0.97905\tvalid_0's ndcg@4: 0.979384\tvalid_0's ndcg@5: 0.979393\n",
+ "[93]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977421\tvalid_0's ndcg@3: 0.979046\tvalid_0's ndcg@4: 0.97938\tvalid_0's ndcg@5: 0.97939\n",
+ "[94]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977431\tvalid_0's ndcg@3: 0.979068\tvalid_0's ndcg@4: 0.979391\tvalid_0's ndcg@5: 0.979401\n",
+ "[95]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977449\tvalid_0's ndcg@3: 0.979074\tvalid_0's ndcg@4: 0.979408\tvalid_0's ndcg@5: 0.979418\n",
+ "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979127\tvalid_0's ndcg@4: 0.979461\tvalid_0's ndcg@5: 0.97947\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[97]\tvalid_0's ndcg@1: 0.946375\tvalid_0's ndcg@2: 0.977622\tvalid_0's ndcg@3: 0.979222\tvalid_0's ndcg@4: 0.979577\tvalid_0's ndcg@5: 0.979577\n",
+ "[98]\tvalid_0's ndcg@1: 0.946625\tvalid_0's ndcg@2: 0.977714\tvalid_0's ndcg@3: 0.979339\tvalid_0's ndcg@4: 0.979673\tvalid_0's ndcg@5: 0.979673\n",
+ "[99]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.977739\tvalid_0's ndcg@3: 0.979352\tvalid_0's ndcg@4: 0.979685\tvalid_0's ndcg@5: 0.979685\n",
+ "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
+ "# 这一部分与前面的单独训练和验证是分开的\n",
+ "def get_kfold_users(trn_df, n=5):\n",
+ " user_ids = trn_df['user_id'].unique()\n",
+ " user_set = [user_ids[i::n] for i in range(n)]\n",
+ " return user_set\n",
+ "\n",
+ "k_fold = 5\n",
+ "trn_df = trn_user_item_feats_df_rank_model\n",
+ "user_set = get_kfold_users(trn_df, n=k_fold)\n",
+ "\n",
+ "score_list = []\n",
+ "score_df = trn_df[['user_id', 'click_article_id','label']]\n",
+ "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
+ "\n",
+ "# 五折交叉验证,并将中间结果保存用于staking\n",
+ "for n_fold, valid_user in enumerate(user_set):\n",
+ " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
+ " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
+ " \n",
+ " # 训练集与验证集的用户分组\n",
+ " train_idx.sort_values(by=['user_id'], inplace=True)\n",
+ " g_train = train_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id'], inplace=True)\n",
+ " g_val = valid_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
+ " \n",
+ " # 定义模型\n",
+ " lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) \n",
+ " # 训练模型\n",
+ " lgb_ranker.fit(train_idx[lgb_cols], train_idx['label'], group=g_train,\n",
+ " eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], eval_group= [g_val], \n",
+ " eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
+ " \n",
+ " # 预测验证集结果\n",
+ " valid_idx['pred_score'] = lgb_ranker.predict(valid_idx[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
+ " \n",
+ " # 对输出结果进行归一化\n",
+ " valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
+ " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
+ " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
+ " \n",
+ " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
+ " if not offline:\n",
+ " sub_preds += lgb_ranker.predict(tst_user_item_feats_df_rank_model[lgb_cols], lgb_ranker.best_iteration_)\n",
+ " \n",
+ "score_df_ = pd.concat(score_list, axis=0)\n",
+ "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
+ "# 保存训练集交叉验证产生的新特征\n",
+ "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_ranker_feats.csv', index=False)\n",
+ " \n",
+ "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
+ "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
+ "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ "\n",
+ "# 保存测试集交叉验证的新特征\n",
+ "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_ranker_feats.csv', index=False)"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Epoch 1/2\n",
- "290964/290964 [==============================] - 55s 189us/sample - loss: 0.4209 - binary_crossentropy: 0.4206 - auc: 0.7842\n",
- "Epoch 2/2\n",
- "290964/290964 [==============================] - 52s 178us/sample - loss: 0.3630 - binary_crossentropy: 0.3618 - auc: 0.8478\n"
- ]
- }
- ],
- "source": [
- "# 模型训练\n",
- "if offline:\n",
- " history = model.fit(x_trn, y_trn, verbose=1, epochs=10, validation_data=(x_val, y_val) , batch_size=256)\n",
- "else:\n",
- " # 也可以使用上面的语句用自己采样出来的验证集\n",
- " # history = model.fit(x_trn, y_trn, verbose=1, epochs=3, validation_split=0.3, batch_size=256)\n",
- " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, batch_size=256)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:29:20.436591Z",
- "start_time": "2020-11-18T04:28:58.102057Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:22:52.604397Z",
+ "start_time": "2020-11-18T04:22:43.253034Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "# 单模型生成提交结果\n",
+ "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_ranker')"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "500000/500000 [==============================] - 20s 39us/sample\n"
- ]
- }
- ],
- "source": [
- "# 模型预测\n",
- "tst_user_item_feats_df_din_model['pred_score'] = model.predict(x_tst, verbose=1, batch_size=256)\n",
- "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'din_rank_score.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:29:34.985535Z",
- "start_time": "2020-11-18T04:29:26.264531Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']]\n",
- "submit(rank_results, topk=5, model_name='din')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-15T06:15:49.490705Z",
- "start_time": "2020-11-15T06:15:49.473794Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": 34,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:38:53.760383Z",
- "start_time": "2020-11-18T04:29:51.737721Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## LGB分类模型"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train on 232681 samples, validate on 58283 samples\n",
- "Epoch 1/2\n",
- "232681/232681 [==============================] - 44s 189us/sample - loss: 0.2864 - binary_crossentropy: 0.2846 - auc: 0.9008 - val_loss: 0.2830 - val_binary_crossentropy: 0.2813 - val_auc: 0.9072\n",
- "Epoch 2/2\n",
- "232681/232681 [==============================] - 44s 187us/sample - loss: 0.2832 - binary_crossentropy: 0.2816 - auc: 0.9034 - val_loss: 0.2846 - val_binary_crossentropy: 0.2830 - val_auc: 0.9053\n",
- "58283/58283 [==============================] - 2s 36us/sample\n",
- "500000/500000 [==============================] - 19s 37us/sample\n",
- "Train on 232798 samples, validate on 58166 samples\n",
- "Epoch 1/2\n",
- "232798/232798 [==============================] - 43s 184us/sample - loss: 0.2818 - binary_crossentropy: 0.2802 - auc: 0.9051 - val_loss: 0.2968 - val_binary_crossentropy: 0.2953 - val_auc: 0.9062\n",
- "Epoch 2/2\n",
- "232798/232798 [==============================] - 44s 187us/sample - loss: 0.2796 - binary_crossentropy: 0.2782 - auc: 0.9069 - val_loss: 0.2820 - val_binary_crossentropy: 0.2806 - val_auc: 0.9071\n",
- "58166/58166 [==============================] - 2s 38us/sample\n",
- "500000/500000 [==============================] - 18s 37us/sample\n",
- "Train on 232847 samples, validate on 58117 samples\n",
- "Epoch 1/2\n",
- "232847/232847 [==============================] - 43s 185us/sample - loss: 0.2786 - binary_crossentropy: 0.2773 - auc: 0.9080 - val_loss: 0.2761 - val_binary_crossentropy: 0.2749 - val_auc: 0.9113\n",
- "Epoch 2/2\n",
- "232847/232847 [==============================] - 39s 166us/sample - loss: 0.2766 - binary_crossentropy: 0.2754 - auc: 0.9097 - val_loss: 0.2872 - val_binary_crossentropy: 0.2862 - val_auc: 0.9090\n",
- "58117/58117 [==============================] - 2s 34us/sample\n",
- "500000/500000 [==============================] - 17s 33us/sample\n",
- "Train on 232716 samples, validate on 58248 samples\n",
- "Epoch 1/2\n",
- "232716/232716 [==============================] - 39s 169us/sample - loss: 0.2763 - binary_crossentropy: 0.2753 - auc: 0.9100 - val_loss: 0.2739 - val_binary_crossentropy: 0.2730 - val_auc: 0.9116\n",
- "Epoch 2/2\n",
- "232716/232716 [==============================] - 39s 168us/sample - loss: 0.2743 - binary_crossentropy: 0.2735 - auc: 0.9119 - val_loss: 0.2859 - val_binary_crossentropy: 0.2851 - val_auc: 0.9090\n",
- "58248/58248 [==============================] - 2s 35us/sample\n",
- "500000/500000 [==============================] - 17s 34us/sample\n",
- "Train on 232814 samples, validate on 58150 samples\n",
- "Epoch 1/2\n",
- "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2747 - binary_crossentropy: 0.2739 - auc: 0.9115 - val_loss: 0.2702 - val_binary_crossentropy: 0.2695 - val_auc: 0.9163\n",
- "Epoch 2/2\n",
- "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2725 - binary_crossentropy: 0.2719 - auc: 0.9132 - val_loss: 0.2751 - val_binary_crossentropy: 0.2745 - val_auc: 0.9151\n",
- "58150/58150 [==============================] - 2s 34us/sample\n",
- "500000/500000 [==============================] - 17s 34us/sample\n"
- ]
- }
- ],
- "source": [
- "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
- "# 这一部分与前面的单独训练和验证是分开的\n",
- "def get_kfold_users(trn_df, n=5):\n",
- " user_ids = trn_df['user_id'].unique()\n",
- " user_set = [user_ids[i::n] for i in range(n)]\n",
- " return user_set\n",
- "\n",
- "k_fold = 5\n",
- "trn_df = trn_user_item_feats_df_din_model\n",
- "user_set = get_kfold_users(trn_df, n=k_fold)\n",
- "\n",
- "score_list = []\n",
- "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
- "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
- "\n",
- "dense_fea = [x for x in dense_fea if x != 'label']\n",
- "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- "\n",
- "# 五折交叉验证,并将中间结果保存用于staking\n",
- "for n_fold, valid_user in enumerate(user_set):\n",
- " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
- " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
- " \n",
- " # 准备训练数据\n",
- " x_trn, dnn_feature_columns = get_din_feats_columns(train_idx, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- " y_trn = train_idx['label'].values\n",
- "\n",
- " # 准备验证数据\n",
- " x_val, dnn_feature_columns = get_din_feats_columns(valid_idx, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- " y_val = valid_idx['label'].values\n",
- " \n",
- " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, validation_data=(x_val, y_val) , batch_size=256)\n",
- " \n",
- " # 预测验证集结果\n",
- " valid_idx['pred_score'] = model.predict(x_val, verbose=1, batch_size=256) \n",
- " \n",
- " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
- " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
- " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
- " \n",
- " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
- " if not offline:\n",
- " sub_preds += model.predict(x_tst, verbose=1, batch_size=256)[:, 0] \n",
- " \n",
- "score_df_ = pd.concat(score_list, axis=0)\n",
- "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
- "# 保存训练集交叉验证产生的新特征\n",
- "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_din_cls_feats.csv', index=False)\n",
- " \n",
- "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
- "tst_user_item_feats_df_din_model['pred_score'] = sub_preds / k_fold\n",
- "tst_user_item_feats_df_din_model['pred_score'] = tst_user_item_feats_df_din_model['pred_score'].transform(lambda x: norm_sim(x))\n",
- "tst_user_item_feats_df_din_model.sort_values(by=['user_id', 'pred_score'])\n",
- "tst_user_item_feats_df_din_model['pred_rank'] = tst_user_item_feats_df_din_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- "\n",
- "# 保存测试集交叉验证的新特征\n",
- "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_din_cls_feats.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 模型融合"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 加权融合"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:27.351996Z",
- "start_time": "2020-11-18T04:44:26.561275Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取多个模型的排序结果文件\n",
- "lgb_ranker = pd.read_csv(save_path + 'lgb_ranker_score.csv')\n",
- "lgb_cls = pd.read_csv(save_path + 'lgb_cls_score.csv')\n",
- "din_ranker = pd.read_csv(save_path + 'din_rank_score.csv')\n",
- "\n",
- "# 这里也可以换成交叉验证输出的测试结果进行加权融合"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:31.593981Z",
- "start_time": "2020-11-18T04:44:31.589439Z"
- }
- },
- "outputs": [],
- "source": [
- "rank_model = {'lgb_ranker': lgb_ranker, \n",
- " 'lgb_cls': lgb_cls, \n",
- " 'din_ranker': din_ranker}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:36.135860Z",
- "start_time": "2020-11-18T04:44:36.130577Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_ensumble_predict_topk(rank_model, topk=5):\n",
- " final_recall = rank_model['lgb_cls'].append(rank_model['din_ranker'])\n",
- " rank_model['lgb_ranker']['pred_score'] = rank_model['lgb_ranker']['pred_score'].transform(lambda x: norm_sim(x))\n",
- " \n",
- " final_recall = final_recall.append(rank_model['lgb_ranker'])\n",
- " final_recall = final_recall.groupby(['user_id', 'click_article_id'])['pred_score'].sum().reset_index()\n",
- " \n",
- " submit(final_recall, topk=topk, model_name='ensemble_fuse')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 38,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:51.659270Z",
- "start_time": "2020-11-18T04:44:40.445659Z"
- }
- },
- "outputs": [],
- "source": [
- "get_ensumble_predict_topk(rank_model)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Staking"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:58.025992Z",
- "start_time": "2020-11-18T04:44:56.146962Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取多个模型的交叉验证生成的结果文件\n",
- "# 训练集\n",
- "trn_lgb_ranker_feats = pd.read_csv(save_path + 'trn_lgb_ranker_feats.csv')\n",
- "trn_lgb_cls_feats = pd.read_csv(save_path + 'trn_lgb_cls_feats.csv')\n",
- "trn_din_cls_feats = pd.read_csv(save_path + 'trn_din_cls_feats.csv')\n",
- "\n",
- "# 测试集\n",
- "tst_lgb_ranker_feats = pd.read_csv(save_path + 'tst_lgb_ranker_feats.csv')\n",
- "tst_lgb_cls_feats = pd.read_csv(save_path + 'tst_lgb_cls_feats.csv')\n",
- "tst_din_cls_feats = pd.read_csv(save_path + 'tst_din_cls_feats.csv')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:45:07.701862Z",
- "start_time": "2020-11-18T04:45:07.644335Z"
- }
- },
- "outputs": [],
- "source": [
- "# 将多个模型输出的特征进行拼接\n",
- "\n",
- "finall_trn_ranker_feats = trn_lgb_ranker_feats[['user_id', 'click_article_id', 'label']]\n",
- "finall_tst_ranker_feats = tst_lgb_ranker_feats[['user_id', 'click_article_id']]\n",
- "\n",
- "for idx, trn_model in enumerate([trn_lgb_ranker_feats, trn_lgb_cls_feats, trn_din_cls_feats]):\n",
- " for feat in [ 'pred_score', 'pred_rank']:\n",
- " col_name = feat + '_' + str(idx)\n",
- " finall_trn_ranker_feats[col_name] = trn_model[feat]\n",
- "\n",
- "for idx, tst_model in enumerate([tst_lgb_ranker_feats, tst_lgb_cls_feats, tst_din_cls_feats]):\n",
- " for feat in [ 'pred_score', 'pred_rank']:\n",
- " col_name = feat + '_' + str(idx)\n",
- " finall_tst_ranker_feats[col_name] = tst_model[feat]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:45:15.044242Z",
- "start_time": "2020-11-18T04:45:13.138252Z"
- }
- },
- "outputs": [],
- "source": [
- "# 定义一个逻辑回归模型再次拟合交叉验证产生的特征对测试集进行预测\n",
- "# 这里需要注意的是,在做交叉验证的时候可以构造多一些与输出预测值相关的特征,来丰富这里简单模型的特征\n",
- "from sklearn.linear_model import LogisticRegression\n",
- "\n",
- "feat_cols = ['pred_score_0', 'pred_rank_0', 'pred_score_1', 'pred_rank_1', 'pred_score_2', 'pred_rank_2']\n",
- "\n",
- "trn_x = finall_trn_ranker_feats[feat_cols]\n",
- "trn_y = finall_trn_ranker_feats['label']\n",
- "\n",
- "tst_x = finall_tst_ranker_feats[feat_cols]\n",
- "\n",
- "# 定义模型\n",
- "lr = LogisticRegression()\n",
- "\n",
- "# 模型训练\n",
- "lr.fit(trn_x, trn_y)\n",
- "\n",
- "# 模型预测\n",
- "finall_tst_ranker_feats['pred_score'] = lr.predict_proba(tst_x)[:, 1]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:45:29.018764Z",
- "start_time": "2020-11-18T04:45:19.423130Z"
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:22:58.259730Z",
+ "start_time": "2020-11-18T04:22:58.254297Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型及参数的定义\n",
+ "lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:23:11.258774Z",
+ "start_time": "2020-11-18T04:23:00.861936Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型训练\n",
+ "if offline:\n",
+ " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'],\n",
+ " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
+ " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
+ "else:\n",
+ " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:23:19.591396Z",
+ "start_time": "2020-11-18T04:23:13.813850Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型预测\n",
+ "tst_user_item_feats_df['pred_score'] = lgb_Classfication.predict_proba(tst_user_item_feats_df[lgb_cols])[:,1]\n",
+ "\n",
+ "# 将这里的排序结果保存一份,用户后面的模型融合\n",
+ "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_cls_score.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:23:32.352931Z",
+ "start_time": "2020-11-18T04:23:22.346609Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_cls')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:24:11.241196Z",
+ "start_time": "2020-11-18T04:23:41.377394Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[1]\tvalid_0's auc: 0.764896\tvalid_0's binary_logloss: 0.522153\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.767857\tvalid_0's binary_logloss: 0.52057\n",
+ "[3]\tvalid_0's auc: 0.783096\tvalid_0's binary_logloss: 0.519584\n",
+ "[4]\tvalid_0's auc: 0.784354\tvalid_0's binary_logloss: 0.518485\n",
+ "[5]\tvalid_0's auc: 0.790554\tvalid_0's binary_logloss: 0.516886\n",
+ "[6]\tvalid_0's auc: 0.791954\tvalid_0's binary_logloss: 0.515334\n",
+ "[7]\tvalid_0's auc: 0.794257\tvalid_0's binary_logloss: 0.514032\n",
+ "[8]\tvalid_0's auc: 0.795222\tvalid_0's binary_logloss: 0.512516\n",
+ "[9]\tvalid_0's auc: 0.795417\tvalid_0's binary_logloss: 0.511671\n",
+ "[10]\tvalid_0's auc: 0.795913\tvalid_0's binary_logloss: 0.510226\n",
+ "[11]\tvalid_0's auc: 0.798222\tvalid_0's binary_logloss: 0.508858\n",
+ "[12]\tvalid_0's auc: 0.79825\tvalid_0's binary_logloss: 0.507928\n",
+ "[13]\tvalid_0's auc: 0.798842\tvalid_0's binary_logloss: 0.50708\n",
+ "[14]\tvalid_0's auc: 0.798935\tvalid_0's binary_logloss: 0.505752\n",
+ "[15]\tvalid_0's auc: 0.799543\tvalid_0's binary_logloss: 0.504388\n",
+ "[16]\tvalid_0's auc: 0.800844\tvalid_0's binary_logloss: 0.503126\n",
+ "[17]\tvalid_0's auc: 0.800855\tvalid_0's binary_logloss: 0.501809\n",
+ "[18]\tvalid_0's auc: 0.801653\tvalid_0's binary_logloss: 0.500676\n",
+ "[19]\tvalid_0's auc: 0.801518\tvalid_0's binary_logloss: 0.49987\n",
+ "[20]\tvalid_0's auc: 0.801662\tvalid_0's binary_logloss: 0.498625\n",
+ "[21]\tvalid_0's auc: 0.802093\tvalid_0's binary_logloss: 0.498113\n",
+ "[22]\tvalid_0's auc: 0.803071\tvalid_0's binary_logloss: 0.496933\n",
+ "[23]\tvalid_0's auc: 0.803222\tvalid_0's binary_logloss: 0.495864\n",
+ "[24]\tvalid_0's auc: 0.802927\tvalid_0's binary_logloss: 0.494691\n",
+ "[25]\tvalid_0's auc: 0.802581\tvalid_0's binary_logloss: 0.493543\n",
+ "[26]\tvalid_0's auc: 0.802965\tvalid_0's binary_logloss: 0.492444\n",
+ "[27]\tvalid_0's auc: 0.80298\tvalid_0's binary_logloss: 0.491336\n",
+ "[28]\tvalid_0's auc: 0.803226\tvalid_0's binary_logloss: 0.490275\n",
+ "[29]\tvalid_0's auc: 0.803436\tvalid_0's binary_logloss: 0.489126\n",
+ "[30]\tvalid_0's auc: 0.803796\tvalid_0's binary_logloss: 0.48802\n",
+ "[31]\tvalid_0's auc: 0.803601\tvalid_0's binary_logloss: 0.486988\n",
+ "[32]\tvalid_0's auc: 0.804416\tvalid_0's binary_logloss: 0.485972\n",
+ "[33]\tvalid_0's auc: 0.804529\tvalid_0's binary_logloss: 0.484939\n",
+ "[34]\tvalid_0's auc: 0.804534\tvalid_0's binary_logloss: 0.483927\n",
+ "[35]\tvalid_0's auc: 0.804819\tvalid_0's binary_logloss: 0.483271\n",
+ "[36]\tvalid_0's auc: 0.804774\tvalid_0's binary_logloss: 0.482273\n",
+ "[37]\tvalid_0's auc: 0.805237\tvalid_0's binary_logloss: 0.481639\n",
+ "[38]\tvalid_0's auc: 0.805546\tvalid_0's binary_logloss: 0.480959\n",
+ "[39]\tvalid_0's auc: 0.805598\tvalid_0's binary_logloss: 0.479955\n",
+ "[40]\tvalid_0's auc: 0.806011\tvalid_0's binary_logloss: 0.47903\n",
+ "[41]\tvalid_0's auc: 0.806664\tvalid_0's binary_logloss: 0.478439\n",
+ "[42]\tvalid_0's auc: 0.807021\tvalid_0's binary_logloss: 0.477798\n",
+ "[43]\tvalid_0's auc: 0.80726\tvalid_0's binary_logloss: 0.476829\n",
+ "[44]\tvalid_0's auc: 0.807157\tvalid_0's binary_logloss: 0.475976\n",
+ "[45]\tvalid_0's auc: 0.807788\tvalid_0's binary_logloss: 0.475056\n",
+ "[46]\tvalid_0's auc: 0.80805\tvalid_0's binary_logloss: 0.474446\n",
+ "[47]\tvalid_0's auc: 0.808097\tvalid_0's binary_logloss: 0.473576\n",
+ "[48]\tvalid_0's auc: 0.80815\tvalid_0's binary_logloss: 0.472676\n",
+ "[49]\tvalid_0's auc: 0.808304\tvalid_0's binary_logloss: 0.471918\n",
+ "[50]\tvalid_0's auc: 0.808749\tvalid_0's binary_logloss: 0.471481\n",
+ "[51]\tvalid_0's auc: 0.808972\tvalid_0's binary_logloss: 0.471104\n",
+ "[52]\tvalid_0's auc: 0.809326\tvalid_0's binary_logloss: 0.470289\n",
+ "[53]\tvalid_0's auc: 0.809472\tvalid_0's binary_logloss: 0.469508\n",
+ "[54]\tvalid_0's auc: 0.809505\tvalid_0's binary_logloss: 0.46869\n",
+ "[55]\tvalid_0's auc: 0.809594\tvalid_0's binary_logloss: 0.467885\n",
+ "[56]\tvalid_0's auc: 0.809847\tvalid_0's binary_logloss: 0.467356\n",
+ "[57]\tvalid_0's auc: 0.810262\tvalid_0's binary_logloss: 0.466531\n",
+ "[58]\tvalid_0's auc: 0.810407\tvalid_0's binary_logloss: 0.46573\n",
+ "[59]\tvalid_0's auc: 0.810618\tvalid_0's binary_logloss: 0.465205\n",
+ "[60]\tvalid_0's auc: 0.81066\tvalid_0's binary_logloss: 0.464435\n",
+ "[61]\tvalid_0's auc: 0.810638\tvalid_0's binary_logloss: 0.463721\n",
+ "[62]\tvalid_0's auc: 0.810658\tvalid_0's binary_logloss: 0.462982\n",
+ "[63]\tvalid_0's auc: 0.811106\tvalid_0's binary_logloss: 0.462246\n",
+ "[64]\tvalid_0's auc: 0.811313\tvalid_0's binary_logloss: 0.461748\n",
+ "[65]\tvalid_0's auc: 0.811351\tvalid_0's binary_logloss: 0.461038\n",
+ "[66]\tvalid_0's auc: 0.811433\tvalid_0's binary_logloss: 0.460323\n",
+ "[67]\tvalid_0's auc: 0.81158\tvalid_0's binary_logloss: 0.459662\n",
+ "[68]\tvalid_0's auc: 0.811561\tvalid_0's binary_logloss: 0.458988\n",
+ "[69]\tvalid_0's auc: 0.811748\tvalid_0's binary_logloss: 0.458592\n",
+ "[70]\tvalid_0's auc: 0.811919\tvalid_0's binary_logloss: 0.457934\n",
+ "[71]\tvalid_0's auc: 0.812073\tvalid_0's binary_logloss: 0.457508\n",
+ "[72]\tvalid_0's auc: 0.812273\tvalid_0's binary_logloss: 0.457038\n",
+ "[73]\tvalid_0's auc: 0.812561\tvalid_0's binary_logloss: 0.456439\n",
+ "[74]\tvalid_0's auc: 0.812633\tvalid_0's binary_logloss: 0.455789\n",
+ "[75]\tvalid_0's auc: 0.812757\tvalid_0's binary_logloss: 0.455173\n",
+ "[76]\tvalid_0's auc: 0.812923\tvalid_0's binary_logloss: 0.454533\n",
+ "[77]\tvalid_0's auc: 0.81295\tvalid_0's binary_logloss: 0.45392\n",
+ "[78]\tvalid_0's auc: 0.813073\tvalid_0's binary_logloss: 0.453517\n",
+ "[79]\tvalid_0's auc: 0.813202\tvalid_0's binary_logloss: 0.452932\n",
+ "[80]\tvalid_0's auc: 0.813611\tvalid_0's binary_logloss: 0.452285\n",
+ "[81]\tvalid_0's auc: 0.813769\tvalid_0's binary_logloss: 0.45191\n",
+ "[82]\tvalid_0's auc: 0.814468\tvalid_0's binary_logloss: 0.451455\n",
+ "[83]\tvalid_0's auc: 0.814656\tvalid_0's binary_logloss: 0.450885\n",
+ "[84]\tvalid_0's auc: 0.814755\tvalid_0's binary_logloss: 0.450308\n",
+ "[85]\tvalid_0's auc: 0.814824\tvalid_0's binary_logloss: 0.449739\n",
+ "[86]\tvalid_0's auc: 0.81499\tvalid_0's binary_logloss: 0.449348\n",
+ "[87]\tvalid_0's auc: 0.815232\tvalid_0's binary_logloss: 0.448759\n",
+ "[88]\tvalid_0's auc: 0.815452\tvalid_0's binary_logloss: 0.44823\n",
+ "[89]\tvalid_0's auc: 0.815593\tvalid_0's binary_logloss: 0.447861\n",
+ "[90]\tvalid_0's auc: 0.815591\tvalid_0's binary_logloss: 0.447323\n",
+ "[91]\tvalid_0's auc: 0.815672\tvalid_0's binary_logloss: 0.446796\n",
+ "[92]\tvalid_0's auc: 0.815875\tvalid_0's binary_logloss: 0.446472\n",
+ "[93]\tvalid_0's auc: 0.815984\tvalid_0's binary_logloss: 0.445961\n",
+ "[94]\tvalid_0's auc: 0.816026\tvalid_0's binary_logloss: 0.445439\n",
+ "[95]\tvalid_0's auc: 0.816172\tvalid_0's binary_logloss: 0.444909\n",
+ "[96]\tvalid_0's auc: 0.816321\tvalid_0's binary_logloss: 0.444413\n",
+ "[97]\tvalid_0's auc: 0.816751\tvalid_0's binary_logloss: 0.44405\n",
+ "[98]\tvalid_0's auc: 0.817226\tvalid_0's binary_logloss: 0.443626\n",
+ "[99]\tvalid_0's auc: 0.817286\tvalid_0's binary_logloss: 0.443136\n",
+ "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
+ "[1]\tvalid_0's auc: 0.771584\tvalid_0's binary_logloss: 0.527139\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.775446\tvalid_0's binary_logloss: 0.525462\n",
+ "[3]\tvalid_0's auc: 0.790092\tvalid_0's binary_logloss: 0.524461\n",
+ "[4]\tvalid_0's auc: 0.791432\tvalid_0's binary_logloss: 0.523322\n",
+ "[5]\tvalid_0's auc: 0.797482\tvalid_0's binary_logloss: 0.521614\n",
+ "[6]\tvalid_0's auc: 0.79893\tvalid_0's binary_logloss: 0.520007\n",
+ "[7]\tvalid_0's auc: 0.800753\tvalid_0's binary_logloss: 0.5187\n",
+ "[8]\tvalid_0's auc: 0.802197\tvalid_0's binary_logloss: 0.517125\n",
+ "[9]\tvalid_0's auc: 0.802828\tvalid_0's binary_logloss: 0.516269\n",
+ "[10]\tvalid_0's auc: 0.803496\tvalid_0's binary_logloss: 0.51474\n",
+ "[11]\tvalid_0's auc: 0.804972\tvalid_0's binary_logloss: 0.513321\n",
+ "[12]\tvalid_0's auc: 0.804995\tvalid_0's binary_logloss: 0.512334\n",
+ "[13]\tvalid_0's auc: 0.80525\tvalid_0's binary_logloss: 0.51151\n",
+ "[14]\tvalid_0's auc: 0.805026\tvalid_0's binary_logloss: 0.510149\n",
+ "[15]\tvalid_0's auc: 0.805622\tvalid_0's binary_logloss: 0.508708\n",
+ "[16]\tvalid_0's auc: 0.806974\tvalid_0's binary_logloss: 0.507384\n",
+ "[17]\tvalid_0's auc: 0.807045\tvalid_0's binary_logloss: 0.506017\n",
+ "[18]\tvalid_0's auc: 0.807265\tvalid_0's binary_logloss: 0.504853\n",
+ "[19]\tvalid_0's auc: 0.807126\tvalid_0's binary_logloss: 0.503972\n",
+ "[20]\tvalid_0's auc: 0.806948\tvalid_0's binary_logloss: 0.502693\n",
+ "[21]\tvalid_0's auc: 0.807315\tvalid_0's binary_logloss: 0.502166\n",
+ "[22]\tvalid_0's auc: 0.808067\tvalid_0's binary_logloss: 0.500948\n",
+ "[23]\tvalid_0's auc: 0.808226\tvalid_0's binary_logloss: 0.49987\n",
+ "[24]\tvalid_0's auc: 0.808268\tvalid_0's binary_logloss: 0.498623\n",
+ "[25]\tvalid_0's auc: 0.808569\tvalid_0's binary_logloss: 0.497389\n",
+ "[26]\tvalid_0's auc: 0.809069\tvalid_0's binary_logloss: 0.49624\n",
+ "[27]\tvalid_0's auc: 0.809312\tvalid_0's binary_logloss: 0.495095\n",
+ "[28]\tvalid_0's auc: 0.809549\tvalid_0's binary_logloss: 0.494012\n",
+ "[29]\tvalid_0's auc: 0.809944\tvalid_0's binary_logloss: 0.492834\n",
+ "[30]\tvalid_0's auc: 0.810047\tvalid_0's binary_logloss: 0.491735\n",
+ "[31]\tvalid_0's auc: 0.810086\tvalid_0's binary_logloss: 0.490633\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[32]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.489595\n",
+ "[33]\tvalid_0's auc: 0.810539\tvalid_0's binary_logloss: 0.488536\n",
+ "[34]\tvalid_0's auc: 0.810529\tvalid_0's binary_logloss: 0.487489\n",
+ "[35]\tvalid_0's auc: 0.810932\tvalid_0's binary_logloss: 0.486775\n",
+ "[36]\tvalid_0's auc: 0.810769\tvalid_0's binary_logloss: 0.48577\n",
+ "[37]\tvalid_0's auc: 0.811363\tvalid_0's binary_logloss: 0.485123\n",
+ "[38]\tvalid_0's auc: 0.811801\tvalid_0's binary_logloss: 0.484413\n",
+ "[39]\tvalid_0's auc: 0.811987\tvalid_0's binary_logloss: 0.483371\n",
+ "[40]\tvalid_0's auc: 0.812268\tvalid_0's binary_logloss: 0.482407\n",
+ "[41]\tvalid_0's auc: 0.813297\tvalid_0's binary_logloss: 0.481742\n",
+ "[42]\tvalid_0's auc: 0.813453\tvalid_0's binary_logloss: 0.481108\n",
+ "[43]\tvalid_0's auc: 0.813603\tvalid_0's binary_logloss: 0.480163\n",
+ "[44]\tvalid_0's auc: 0.813654\tvalid_0's binary_logloss: 0.479239\n",
+ "[45]\tvalid_0's auc: 0.814267\tvalid_0's binary_logloss: 0.478299\n",
+ "[46]\tvalid_0's auc: 0.81455\tvalid_0's binary_logloss: 0.477678\n",
+ "[47]\tvalid_0's auc: 0.81452\tvalid_0's binary_logloss: 0.476766\n",
+ "[48]\tvalid_0's auc: 0.814925\tvalid_0's binary_logloss: 0.475815\n",
+ "[49]\tvalid_0's auc: 0.814907\tvalid_0's binary_logloss: 0.47503\n",
+ "[50]\tvalid_0's auc: 0.815278\tvalid_0's binary_logloss: 0.474588\n",
+ "[51]\tvalid_0's auc: 0.815535\tvalid_0's binary_logloss: 0.474171\n",
+ "[52]\tvalid_0's auc: 0.815685\tvalid_0's binary_logloss: 0.473335\n",
+ "[53]\tvalid_0's auc: 0.815787\tvalid_0's binary_logloss: 0.472509\n",
+ "[54]\tvalid_0's auc: 0.815827\tvalid_0's binary_logloss: 0.471686\n",
+ "[55]\tvalid_0's auc: 0.815871\tvalid_0's binary_logloss: 0.470838\n",
+ "[56]\tvalid_0's auc: 0.816238\tvalid_0's binary_logloss: 0.470285\n",
+ "[57]\tvalid_0's auc: 0.816269\tvalid_0's binary_logloss: 0.469495\n",
+ "[58]\tvalid_0's auc: 0.816528\tvalid_0's binary_logloss: 0.468654\n",
+ "[59]\tvalid_0's auc: 0.816706\tvalid_0's binary_logloss: 0.468122\n",
+ "[60]\tvalid_0's auc: 0.816821\tvalid_0's binary_logloss: 0.467352\n",
+ "[61]\tvalid_0's auc: 0.816759\tvalid_0's binary_logloss: 0.466622\n",
+ "[62]\tvalid_0's auc: 0.81682\tvalid_0's binary_logloss: 0.465867\n",
+ "[63]\tvalid_0's auc: 0.817251\tvalid_0's binary_logloss: 0.465112\n",
+ "[64]\tvalid_0's auc: 0.817476\tvalid_0's binary_logloss: 0.464589\n",
+ "[65]\tvalid_0's auc: 0.817613\tvalid_0's binary_logloss: 0.463831\n",
+ "[66]\tvalid_0's auc: 0.817648\tvalid_0's binary_logloss: 0.463098\n",
+ "[67]\tvalid_0's auc: 0.817719\tvalid_0's binary_logloss: 0.462414\n",
+ "[68]\tvalid_0's auc: 0.817814\tvalid_0's binary_logloss: 0.461727\n",
+ "[69]\tvalid_0's auc: 0.817973\tvalid_0's binary_logloss: 0.461329\n",
+ "[70]\tvalid_0's auc: 0.818108\tvalid_0's binary_logloss: 0.460674\n",
+ "[71]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.460222\n",
+ "[72]\tvalid_0's auc: 0.818456\tvalid_0's binary_logloss: 0.45977\n",
+ "[73]\tvalid_0's auc: 0.818727\tvalid_0's binary_logloss: 0.459157\n",
+ "[74]\tvalid_0's auc: 0.818988\tvalid_0's binary_logloss: 0.458437\n",
+ "[75]\tvalid_0's auc: 0.819144\tvalid_0's binary_logloss: 0.457808\n",
+ "[76]\tvalid_0's auc: 0.819259\tvalid_0's binary_logloss: 0.457159\n",
+ "[77]\tvalid_0's auc: 0.819343\tvalid_0's binary_logloss: 0.456512\n",
+ "[78]\tvalid_0's auc: 0.81954\tvalid_0's binary_logloss: 0.456045\n",
+ "[79]\tvalid_0's auc: 0.819687\tvalid_0's binary_logloss: 0.455416\n",
+ "[80]\tvalid_0's auc: 0.819958\tvalid_0's binary_logloss: 0.454765\n",
+ "[81]\tvalid_0's auc: 0.820115\tvalid_0's binary_logloss: 0.45436\n",
+ "[82]\tvalid_0's auc: 0.820536\tvalid_0's binary_logloss: 0.453965\n",
+ "[83]\tvalid_0's auc: 0.820649\tvalid_0's binary_logloss: 0.453383\n",
+ "[84]\tvalid_0's auc: 0.820663\tvalid_0's binary_logloss: 0.452804\n",
+ "[85]\tvalid_0's auc: 0.820809\tvalid_0's binary_logloss: 0.452167\n",
+ "[86]\tvalid_0's auc: 0.821024\tvalid_0's binary_logloss: 0.451735\n",
+ "[87]\tvalid_0's auc: 0.821124\tvalid_0's binary_logloss: 0.451167\n",
+ "[88]\tvalid_0's auc: 0.821243\tvalid_0's binary_logloss: 0.45061\n",
+ "[89]\tvalid_0's auc: 0.821404\tvalid_0's binary_logloss: 0.450215\n",
+ "[90]\tvalid_0's auc: 0.821488\tvalid_0's binary_logloss: 0.449656\n",
+ "[91]\tvalid_0's auc: 0.821538\tvalid_0's binary_logloss: 0.449107\n",
+ "[92]\tvalid_0's auc: 0.82172\tvalid_0's binary_logloss: 0.448752\n",
+ "[93]\tvalid_0's auc: 0.821809\tvalid_0's binary_logloss: 0.448188\n",
+ "[94]\tvalid_0's auc: 0.82184\tvalid_0's binary_logloss: 0.447659\n",
+ "[95]\tvalid_0's auc: 0.821971\tvalid_0's binary_logloss: 0.447108\n",
+ "[96]\tvalid_0's auc: 0.822086\tvalid_0's binary_logloss: 0.446596\n",
+ "[97]\tvalid_0's auc: 0.82247\tvalid_0's binary_logloss: 0.446244\n",
+ "[98]\tvalid_0's auc: 0.822951\tvalid_0's binary_logloss: 0.445812\n",
+ "[99]\tvalid_0's auc: 0.822991\tvalid_0's binary_logloss: 0.445329\n",
+ "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
+ "[1]\tvalid_0's auc: 0.769525\tvalid_0's binary_logloss: 0.526256\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.775857\tvalid_0's binary_logloss: 0.524594\n",
+ "[3]\tvalid_0's auc: 0.785307\tvalid_0's binary_logloss: 0.523606\n",
+ "[4]\tvalid_0's auc: 0.786356\tvalid_0's binary_logloss: 0.522495\n",
+ "[5]\tvalid_0's auc: 0.793385\tvalid_0's binary_logloss: 0.520812\n",
+ "[6]\tvalid_0's auc: 0.794014\tvalid_0's binary_logloss: 0.519253\n",
+ "[7]\tvalid_0's auc: 0.795454\tvalid_0's binary_logloss: 0.517961\n",
+ "[8]\tvalid_0's auc: 0.79807\tvalid_0's binary_logloss: 0.516363\n",
+ "[9]\tvalid_0's auc: 0.798756\tvalid_0's binary_logloss: 0.51548\n",
+ "[10]\tvalid_0's auc: 0.798314\tvalid_0's binary_logloss: 0.514021\n",
+ "[11]\tvalid_0's auc: 0.799343\tvalid_0's binary_logloss: 0.512678\n",
+ "[12]\tvalid_0's auc: 0.799573\tvalid_0's binary_logloss: 0.511708\n",
+ "[13]\tvalid_0's auc: 0.799563\tvalid_0's binary_logloss: 0.510892\n",
+ "[14]\tvalid_0's auc: 0.800333\tvalid_0's binary_logloss: 0.509532\n",
+ "[15]\tvalid_0's auc: 0.800672\tvalid_0's binary_logloss: 0.508117\n",
+ "[16]\tvalid_0's auc: 0.801953\tvalid_0's binary_logloss: 0.506866\n",
+ "[17]\tvalid_0's auc: 0.802078\tvalid_0's binary_logloss: 0.5055\n",
+ "[18]\tvalid_0's auc: 0.802449\tvalid_0's binary_logloss: 0.504358\n",
+ "[19]\tvalid_0's auc: 0.802329\tvalid_0's binary_logloss: 0.503503\n",
+ "[20]\tvalid_0's auc: 0.802437\tvalid_0's binary_logloss: 0.502233\n",
+ "[21]\tvalid_0's auc: 0.802653\tvalid_0's binary_logloss: 0.50174\n",
+ "[22]\tvalid_0's auc: 0.803753\tvalid_0's binary_logloss: 0.50056\n",
+ "[23]\tvalid_0's auc: 0.803956\tvalid_0's binary_logloss: 0.499496\n",
+ "[24]\tvalid_0's auc: 0.804231\tvalid_0's binary_logloss: 0.498283\n",
+ "[25]\tvalid_0's auc: 0.804554\tvalid_0's binary_logloss: 0.497059\n",
+ "[26]\tvalid_0's auc: 0.805133\tvalid_0's binary_logloss: 0.495963\n",
+ "[27]\tvalid_0's auc: 0.805333\tvalid_0's binary_logloss: 0.494842\n",
+ "[28]\tvalid_0's auc: 0.805644\tvalid_0's binary_logloss: 0.493771\n",
+ "[29]\tvalid_0's auc: 0.806029\tvalid_0's binary_logloss: 0.492598\n",
+ "[30]\tvalid_0's auc: 0.806321\tvalid_0's binary_logloss: 0.491474\n",
+ "[31]\tvalid_0's auc: 0.806201\tvalid_0's binary_logloss: 0.490419\n",
+ "[32]\tvalid_0's auc: 0.806671\tvalid_0's binary_logloss: 0.489393\n",
+ "[33]\tvalid_0's auc: 0.806899\tvalid_0's binary_logloss: 0.488331\n",
+ "[34]\tvalid_0's auc: 0.807105\tvalid_0's binary_logloss: 0.487277\n",
+ "[35]\tvalid_0's auc: 0.807257\tvalid_0's binary_logloss: 0.486592\n",
+ "[36]\tvalid_0's auc: 0.80729\tvalid_0's binary_logloss: 0.485607\n",
+ "[37]\tvalid_0's auc: 0.807752\tvalid_0's binary_logloss: 0.484951\n",
+ "[38]\tvalid_0's auc: 0.808191\tvalid_0's binary_logloss: 0.484269\n",
+ "[39]\tvalid_0's auc: 0.808417\tvalid_0's binary_logloss: 0.483242\n",
+ "[40]\tvalid_0's auc: 0.808761\tvalid_0's binary_logloss: 0.482291\n",
+ "[41]\tvalid_0's auc: 0.80965\tvalid_0's binary_logloss: 0.48164\n",
+ "[42]\tvalid_0's auc: 0.810065\tvalid_0's binary_logloss: 0.480962\n",
+ "[43]\tvalid_0's auc: 0.810209\tvalid_0's binary_logloss: 0.479995\n",
+ "[44]\tvalid_0's auc: 0.810091\tvalid_0's binary_logloss: 0.479077\n",
+ "[45]\tvalid_0's auc: 0.810573\tvalid_0's binary_logloss: 0.478185\n",
+ "[46]\tvalid_0's auc: 0.810924\tvalid_0's binary_logloss: 0.477558\n",
+ "[47]\tvalid_0's auc: 0.810951\tvalid_0's binary_logloss: 0.476662\n",
+ "[48]\tvalid_0's auc: 0.811101\tvalid_0's binary_logloss: 0.475745\n",
+ "[49]\tvalid_0's auc: 0.811269\tvalid_0's binary_logloss: 0.474951\n",
+ "[50]\tvalid_0's auc: 0.81173\tvalid_0's binary_logloss: 0.474514\n",
+ "[51]\tvalid_0's auc: 0.811937\tvalid_0's binary_logloss: 0.474114\n",
+ "[52]\tvalid_0's auc: 0.812136\tvalid_0's binary_logloss: 0.473297\n",
+ "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.472497\n",
+ "[54]\tvalid_0's auc: 0.812121\tvalid_0's binary_logloss: 0.471696\n",
+ "[55]\tvalid_0's auc: 0.812164\tvalid_0's binary_logloss: 0.470905\n",
+ "[56]\tvalid_0's auc: 0.812462\tvalid_0's binary_logloss: 0.470384\n",
+ "[57]\tvalid_0's auc: 0.812613\tvalid_0's binary_logloss: 0.4696\n",
+ "[58]\tvalid_0's auc: 0.812615\tvalid_0's binary_logloss: 0.468778\n",
+ "[59]\tvalid_0's auc: 0.812842\tvalid_0's binary_logloss: 0.468211\n",
+ "[60]\tvalid_0's auc: 0.81312\tvalid_0's binary_logloss: 0.467385\n",
+ "[61]\tvalid_0's auc: 0.813039\tvalid_0's binary_logloss: 0.466632\n",
+ "[62]\tvalid_0's auc: 0.812942\tvalid_0's binary_logloss: 0.465933\n",
+ "[63]\tvalid_0's auc: 0.813274\tvalid_0's binary_logloss: 0.465214\n",
+ "[64]\tvalid_0's auc: 0.813572\tvalid_0's binary_logloss: 0.464692\n",
+ "[65]\tvalid_0's auc: 0.813594\tvalid_0's binary_logloss: 0.463925\n",
+ "[66]\tvalid_0's auc: 0.813719\tvalid_0's binary_logloss: 0.463177\n",
+ "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.462513\n",
+ "[68]\tvalid_0's auc: 0.813989\tvalid_0's binary_logloss: 0.461843\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[69]\tvalid_0's auc: 0.814218\tvalid_0's binary_logloss: 0.461443\n",
+ "[70]\tvalid_0's auc: 0.814334\tvalid_0's binary_logloss: 0.460775\n",
+ "[71]\tvalid_0's auc: 0.814493\tvalid_0's binary_logloss: 0.460332\n",
+ "[72]\tvalid_0's auc: 0.814663\tvalid_0's binary_logloss: 0.459867\n",
+ "[73]\tvalid_0's auc: 0.814856\tvalid_0's binary_logloss: 0.459266\n",
+ "[74]\tvalid_0's auc: 0.815017\tvalid_0's binary_logloss: 0.458585\n",
+ "[75]\tvalid_0's auc: 0.815186\tvalid_0's binary_logloss: 0.457958\n",
+ "[76]\tvalid_0's auc: 0.815374\tvalid_0's binary_logloss: 0.457316\n",
+ "[77]\tvalid_0's auc: 0.81554\tvalid_0's binary_logloss: 0.45665\n",
+ "[78]\tvalid_0's auc: 0.81569\tvalid_0's binary_logloss: 0.456217\n",
+ "[79]\tvalid_0's auc: 0.815861\tvalid_0's binary_logloss: 0.455615\n",
+ "[80]\tvalid_0's auc: 0.816443\tvalid_0's binary_logloss: 0.454895\n",
+ "[81]\tvalid_0's auc: 0.816659\tvalid_0's binary_logloss: 0.454503\n",
+ "[82]\tvalid_0's auc: 0.817017\tvalid_0's binary_logloss: 0.454149\n",
+ "[83]\tvalid_0's auc: 0.817162\tvalid_0's binary_logloss: 0.453578\n",
+ "[84]\tvalid_0's auc: 0.817274\tvalid_0's binary_logloss: 0.452984\n",
+ "[85]\tvalid_0's auc: 0.817283\tvalid_0's binary_logloss: 0.452416\n",
+ "[86]\tvalid_0's auc: 0.817339\tvalid_0's binary_logloss: 0.452022\n",
+ "[87]\tvalid_0's auc: 0.817494\tvalid_0's binary_logloss: 0.45146\n",
+ "[88]\tvalid_0's auc: 0.817594\tvalid_0's binary_logloss: 0.450926\n",
+ "[89]\tvalid_0's auc: 0.817771\tvalid_0's binary_logloss: 0.450553\n",
+ "[90]\tvalid_0's auc: 0.81789\tvalid_0's binary_logloss: 0.449985\n",
+ "[91]\tvalid_0's auc: 0.817931\tvalid_0's binary_logloss: 0.449439\n",
+ "[92]\tvalid_0's auc: 0.818138\tvalid_0's binary_logloss: 0.449094\n",
+ "[93]\tvalid_0's auc: 0.818334\tvalid_0's binary_logloss: 0.448527\n",
+ "[94]\tvalid_0's auc: 0.818426\tvalid_0's binary_logloss: 0.447989\n",
+ "[95]\tvalid_0's auc: 0.818676\tvalid_0's binary_logloss: 0.447407\n",
+ "[96]\tvalid_0's auc: 0.818852\tvalid_0's binary_logloss: 0.446884\n",
+ "[97]\tvalid_0's auc: 0.81945\tvalid_0's binary_logloss: 0.446455\n",
+ "[98]\tvalid_0's auc: 0.819861\tvalid_0's binary_logloss: 0.446045\n",
+ "[99]\tvalid_0's auc: 0.819943\tvalid_0's binary_logloss: 0.445543\n",
+ "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
+ "[1]\tvalid_0's auc: 0.770032\tvalid_0's binary_logloss: 0.527241\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.779881\tvalid_0's binary_logloss: 0.525545\n",
+ "[3]\tvalid_0's auc: 0.791308\tvalid_0's binary_logloss: 0.524508\n",
+ "[4]\tvalid_0's auc: 0.790788\tvalid_0's binary_logloss: 0.52341\n",
+ "[5]\tvalid_0's auc: 0.795645\tvalid_0's binary_logloss: 0.521753\n",
+ "[6]\tvalid_0's auc: 0.797745\tvalid_0's binary_logloss: 0.520131\n",
+ "[7]\tvalid_0's auc: 0.79931\tvalid_0's binary_logloss: 0.518872\n",
+ "[8]\tvalid_0's auc: 0.800014\tvalid_0's binary_logloss: 0.517353\n",
+ "[9]\tvalid_0's auc: 0.800549\tvalid_0's binary_logloss: 0.516487\n",
+ "[10]\tvalid_0's auc: 0.800261\tvalid_0's binary_logloss: 0.515039\n",
+ "[11]\tvalid_0's auc: 0.801261\tvalid_0's binary_logloss: 0.513695\n",
+ "[12]\tvalid_0's auc: 0.801062\tvalid_0's binary_logloss: 0.512735\n",
+ "[13]\tvalid_0's auc: 0.801155\tvalid_0's binary_logloss: 0.51192\n",
+ "[14]\tvalid_0's auc: 0.801315\tvalid_0's binary_logloss: 0.510559\n",
+ "[15]\tvalid_0's auc: 0.80185\tvalid_0's binary_logloss: 0.509147\n",
+ "[16]\tvalid_0's auc: 0.803029\tvalid_0's binary_logloss: 0.507914\n",
+ "[17]\tvalid_0's auc: 0.803035\tvalid_0's binary_logloss: 0.506583\n",
+ "[18]\tvalid_0's auc: 0.803433\tvalid_0's binary_logloss: 0.505441\n",
+ "[19]\tvalid_0's auc: 0.803717\tvalid_0's binary_logloss: 0.504599\n",
+ "[20]\tvalid_0's auc: 0.803819\tvalid_0's binary_logloss: 0.503327\n",
+ "[21]\tvalid_0's auc: 0.803923\tvalid_0's binary_logloss: 0.502782\n",
+ "[22]\tvalid_0's auc: 0.804939\tvalid_0's binary_logloss: 0.501596\n",
+ "[23]\tvalid_0's auc: 0.804707\tvalid_0's binary_logloss: 0.500572\n",
+ "[24]\tvalid_0's auc: 0.804632\tvalid_0's binary_logloss: 0.499367\n",
+ "[25]\tvalid_0's auc: 0.804756\tvalid_0's binary_logloss: 0.498161\n",
+ "[26]\tvalid_0's auc: 0.805067\tvalid_0's binary_logloss: 0.497061\n",
+ "[27]\tvalid_0's auc: 0.805119\tvalid_0's binary_logloss: 0.495933\n",
+ "[28]\tvalid_0's auc: 0.805304\tvalid_0's binary_logloss: 0.494849\n",
+ "[29]\tvalid_0's auc: 0.805688\tvalid_0's binary_logloss: 0.493677\n",
+ "[30]\tvalid_0's auc: 0.805822\tvalid_0's binary_logloss: 0.492594\n",
+ "[31]\tvalid_0's auc: 0.805869\tvalid_0's binary_logloss: 0.49152\n",
+ "[32]\tvalid_0's auc: 0.807267\tvalid_0's binary_logloss: 0.490435\n",
+ "[33]\tvalid_0's auc: 0.807301\tvalid_0's binary_logloss: 0.489392\n",
+ "[34]\tvalid_0's auc: 0.80736\tvalid_0's binary_logloss: 0.488325\n",
+ "[35]\tvalid_0's auc: 0.807706\tvalid_0's binary_logloss: 0.487654\n",
+ "[36]\tvalid_0's auc: 0.807758\tvalid_0's binary_logloss: 0.486651\n",
+ "[37]\tvalid_0's auc: 0.808051\tvalid_0's binary_logloss: 0.486012\n",
+ "[38]\tvalid_0's auc: 0.808429\tvalid_0's binary_logloss: 0.485355\n",
+ "[39]\tvalid_0's auc: 0.808663\tvalid_0's binary_logloss: 0.484327\n",
+ "[40]\tvalid_0's auc: 0.809007\tvalid_0's binary_logloss: 0.483386\n",
+ "[41]\tvalid_0's auc: 0.809781\tvalid_0's binary_logloss: 0.482745\n",
+ "[42]\tvalid_0's auc: 0.810071\tvalid_0's binary_logloss: 0.482124\n",
+ "[43]\tvalid_0's auc: 0.810383\tvalid_0's binary_logloss: 0.481154\n",
+ "[44]\tvalid_0's auc: 0.810446\tvalid_0's binary_logloss: 0.480243\n",
+ "[45]\tvalid_0's auc: 0.811148\tvalid_0's binary_logloss: 0.479261\n",
+ "[46]\tvalid_0's auc: 0.811245\tvalid_0's binary_logloss: 0.478687\n",
+ "[47]\tvalid_0's auc: 0.811214\tvalid_0's binary_logloss: 0.477812\n",
+ "[48]\tvalid_0's auc: 0.811408\tvalid_0's binary_logloss: 0.47689\n",
+ "[49]\tvalid_0's auc: 0.811486\tvalid_0's binary_logloss: 0.476132\n",
+ "[50]\tvalid_0's auc: 0.811806\tvalid_0's binary_logloss: 0.475718\n",
+ "[51]\tvalid_0's auc: 0.812017\tvalid_0's binary_logloss: 0.475342\n",
+ "[52]\tvalid_0's auc: 0.812255\tvalid_0's binary_logloss: 0.474505\n",
+ "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.473707\n",
+ "[54]\tvalid_0's auc: 0.812235\tvalid_0's binary_logloss: 0.47289\n",
+ "[55]\tvalid_0's auc: 0.812233\tvalid_0's binary_logloss: 0.472091\n",
+ "[56]\tvalid_0's auc: 0.812492\tvalid_0's binary_logloss: 0.471563\n",
+ "[57]\tvalid_0's auc: 0.812579\tvalid_0's binary_logloss: 0.47077\n",
+ "[58]\tvalid_0's auc: 0.812598\tvalid_0's binary_logloss: 0.469992\n",
+ "[59]\tvalid_0's auc: 0.812885\tvalid_0's binary_logloss: 0.469458\n",
+ "[60]\tvalid_0's auc: 0.812995\tvalid_0's binary_logloss: 0.468676\n",
+ "[61]\tvalid_0's auc: 0.812961\tvalid_0's binary_logloss: 0.467939\n",
+ "[62]\tvalid_0's auc: 0.812919\tvalid_0's binary_logloss: 0.467232\n",
+ "[63]\tvalid_0's auc: 0.813291\tvalid_0's binary_logloss: 0.466491\n",
+ "[64]\tvalid_0's auc: 0.813702\tvalid_0's binary_logloss: 0.465945\n",
+ "[65]\tvalid_0's auc: 0.813803\tvalid_0's binary_logloss: 0.465197\n",
+ "[66]\tvalid_0's auc: 0.813851\tvalid_0's binary_logloss: 0.4645\n",
+ "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.463814\n",
+ "[68]\tvalid_0's auc: 0.814027\tvalid_0's binary_logloss: 0.463113\n",
+ "[69]\tvalid_0's auc: 0.814138\tvalid_0's binary_logloss: 0.462727\n",
+ "[70]\tvalid_0's auc: 0.814365\tvalid_0's binary_logloss: 0.462077\n",
+ "[71]\tvalid_0's auc: 0.814432\tvalid_0's binary_logloss: 0.461655\n",
+ "[72]\tvalid_0's auc: 0.8146\tvalid_0's binary_logloss: 0.461194\n",
+ "[73]\tvalid_0's auc: 0.815324\tvalid_0's binary_logloss: 0.460477\n",
+ "[74]\tvalid_0's auc: 0.815411\tvalid_0's binary_logloss: 0.459805\n",
+ "[75]\tvalid_0's auc: 0.815548\tvalid_0's binary_logloss: 0.459189\n",
+ "[76]\tvalid_0's auc: 0.815625\tvalid_0's binary_logloss: 0.458525\n",
+ "[77]\tvalid_0's auc: 0.81562\tvalid_0's binary_logloss: 0.457905\n",
+ "[78]\tvalid_0's auc: 0.815786\tvalid_0's binary_logloss: 0.45747\n",
+ "[79]\tvalid_0's auc: 0.815834\tvalid_0's binary_logloss: 0.456884\n",
+ "[80]\tvalid_0's auc: 0.816475\tvalid_0's binary_logloss: 0.45617\n",
+ "[81]\tvalid_0's auc: 0.816677\tvalid_0's binary_logloss: 0.455787\n",
+ "[82]\tvalid_0's auc: 0.817255\tvalid_0's binary_logloss: 0.455358\n",
+ "[83]\tvalid_0's auc: 0.817383\tvalid_0's binary_logloss: 0.454775\n",
+ "[84]\tvalid_0's auc: 0.817509\tvalid_0's binary_logloss: 0.454176\n",
+ "[85]\tvalid_0's auc: 0.817572\tvalid_0's binary_logloss: 0.453609\n",
+ "[86]\tvalid_0's auc: 0.817721\tvalid_0's binary_logloss: 0.453213\n",
+ "[87]\tvalid_0's auc: 0.817992\tvalid_0's binary_logloss: 0.452586\n",
+ "[88]\tvalid_0's auc: 0.81808\tvalid_0's binary_logloss: 0.45204\n",
+ "[89]\tvalid_0's auc: 0.818202\tvalid_0's binary_logloss: 0.451643\n",
+ "[90]\tvalid_0's auc: 0.818336\tvalid_0's binary_logloss: 0.451081\n",
+ "[91]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.450531\n",
+ "[92]\tvalid_0's auc: 0.818558\tvalid_0's binary_logloss: 0.450179\n",
+ "[93]\tvalid_0's auc: 0.818743\tvalid_0's binary_logloss: 0.449647\n",
+ "[94]\tvalid_0's auc: 0.818789\tvalid_0's binary_logloss: 0.449133\n",
+ "[95]\tvalid_0's auc: 0.818849\tvalid_0's binary_logloss: 0.44862\n",
+ "[96]\tvalid_0's auc: 0.81913\tvalid_0's binary_logloss: 0.448072\n",
+ "[97]\tvalid_0's auc: 0.819526\tvalid_0's binary_logloss: 0.447713\n",
+ "[98]\tvalid_0's auc: 0.819971\tvalid_0's binary_logloss: 0.447296\n",
+ "[99]\tvalid_0's auc: 0.819972\tvalid_0's binary_logloss: 0.446814\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
+ "[1]\tvalid_0's auc: 0.768646\tvalid_0's binary_logloss: 0.527167\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.779902\tvalid_0's binary_logloss: 0.525481\n",
+ "[3]\tvalid_0's auc: 0.789868\tvalid_0's binary_logloss: 0.524485\n",
+ "[4]\tvalid_0's auc: 0.791895\tvalid_0's binary_logloss: 0.523382\n",
+ "[5]\tvalid_0's auc: 0.795453\tvalid_0's binary_logloss: 0.521759\n",
+ "[6]\tvalid_0's auc: 0.796672\tvalid_0's binary_logloss: 0.520166\n",
+ "[7]\tvalid_0's auc: 0.798023\tvalid_0's binary_logloss: 0.518857\n",
+ "[8]\tvalid_0's auc: 0.799331\tvalid_0's binary_logloss: 0.517297\n",
+ "[9]\tvalid_0's auc: 0.800181\tvalid_0's binary_logloss: 0.516416\n",
+ "[10]\tvalid_0's auc: 0.800373\tvalid_0's binary_logloss: 0.514967\n",
+ "[11]\tvalid_0's auc: 0.801087\tvalid_0's binary_logloss: 0.513631\n",
+ "[12]\tvalid_0's auc: 0.801122\tvalid_0's binary_logloss: 0.512658\n",
+ "[13]\tvalid_0's auc: 0.801043\tvalid_0's binary_logloss: 0.511833\n",
+ "[14]\tvalid_0's auc: 0.801238\tvalid_0's binary_logloss: 0.510461\n",
+ "[15]\tvalid_0's auc: 0.801847\tvalid_0's binary_logloss: 0.509034\n",
+ "[16]\tvalid_0's auc: 0.803139\tvalid_0's binary_logloss: 0.507759\n",
+ "[17]\tvalid_0's auc: 0.803577\tvalid_0's binary_logloss: 0.506361\n",
+ "[18]\tvalid_0's auc: 0.803834\tvalid_0's binary_logloss: 0.505229\n",
+ "[19]\tvalid_0's auc: 0.803943\tvalid_0's binary_logloss: 0.504371\n",
+ "[20]\tvalid_0's auc: 0.80415\tvalid_0's binary_logloss: 0.503102\n",
+ "[21]\tvalid_0's auc: 0.804446\tvalid_0's binary_logloss: 0.502564\n",
+ "[22]\tvalid_0's auc: 0.805163\tvalid_0's binary_logloss: 0.501396\n",
+ "[23]\tvalid_0's auc: 0.805323\tvalid_0's binary_logloss: 0.500327\n",
+ "[24]\tvalid_0's auc: 0.805314\tvalid_0's binary_logloss: 0.499123\n",
+ "[25]\tvalid_0's auc: 0.80535\tvalid_0's binary_logloss: 0.497927\n",
+ "[26]\tvalid_0's auc: 0.805864\tvalid_0's binary_logloss: 0.496834\n",
+ "[27]\tvalid_0's auc: 0.805919\tvalid_0's binary_logloss: 0.495667\n",
+ "[28]\tvalid_0's auc: 0.806272\tvalid_0's binary_logloss: 0.494606\n",
+ "[29]\tvalid_0's auc: 0.806599\tvalid_0's binary_logloss: 0.49343\n",
+ "[30]\tvalid_0's auc: 0.806932\tvalid_0's binary_logloss: 0.492303\n",
+ "[31]\tvalid_0's auc: 0.806656\tvalid_0's binary_logloss: 0.491249\n",
+ "[32]\tvalid_0's auc: 0.807436\tvalid_0's binary_logloss: 0.490188\n",
+ "[33]\tvalid_0's auc: 0.807629\tvalid_0's binary_logloss: 0.489117\n",
+ "[34]\tvalid_0's auc: 0.807501\tvalid_0's binary_logloss: 0.48808\n",
+ "[35]\tvalid_0's auc: 0.807885\tvalid_0's binary_logloss: 0.487383\n",
+ "[36]\tvalid_0's auc: 0.807921\tvalid_0's binary_logloss: 0.48636\n",
+ "[37]\tvalid_0's auc: 0.808267\tvalid_0's binary_logloss: 0.485724\n",
+ "[38]\tvalid_0's auc: 0.808563\tvalid_0's binary_logloss: 0.485076\n",
+ "[39]\tvalid_0's auc: 0.808813\tvalid_0's binary_logloss: 0.484039\n",
+ "[40]\tvalid_0's auc: 0.809023\tvalid_0's binary_logloss: 0.483091\n",
+ "[41]\tvalid_0's auc: 0.809782\tvalid_0's binary_logloss: 0.482441\n",
+ "[42]\tvalid_0's auc: 0.810135\tvalid_0's binary_logloss: 0.48179\n",
+ "[43]\tvalid_0's auc: 0.810219\tvalid_0's binary_logloss: 0.48082\n",
+ "[44]\tvalid_0's auc: 0.81031\tvalid_0's binary_logloss: 0.479906\n",
+ "[45]\tvalid_0's auc: 0.810514\tvalid_0's binary_logloss: 0.479024\n",
+ "[46]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.478437\n",
+ "[47]\tvalid_0's auc: 0.810611\tvalid_0's binary_logloss: 0.477529\n",
+ "[48]\tvalid_0's auc: 0.810781\tvalid_0's binary_logloss: 0.476637\n",
+ "[49]\tvalid_0's auc: 0.81089\tvalid_0's binary_logloss: 0.475883\n",
+ "[50]\tvalid_0's auc: 0.811266\tvalid_0's binary_logloss: 0.475459\n",
+ "[51]\tvalid_0's auc: 0.811402\tvalid_0's binary_logloss: 0.475078\n",
+ "[52]\tvalid_0's auc: 0.811765\tvalid_0's binary_logloss: 0.474246\n",
+ "[53]\tvalid_0's auc: 0.811891\tvalid_0's binary_logloss: 0.473452\n",
+ "[54]\tvalid_0's auc: 0.811868\tvalid_0's binary_logloss: 0.47263\n",
+ "[55]\tvalid_0's auc: 0.81192\tvalid_0's binary_logloss: 0.471804\n",
+ "[56]\tvalid_0's auc: 0.812272\tvalid_0's binary_logloss: 0.471275\n",
+ "[57]\tvalid_0's auc: 0.812639\tvalid_0's binary_logloss: 0.470396\n",
+ "[58]\tvalid_0's auc: 0.812764\tvalid_0's binary_logloss: 0.469597\n",
+ "[59]\tvalid_0's auc: 0.813084\tvalid_0's binary_logloss: 0.469049\n",
+ "[60]\tvalid_0's auc: 0.813342\tvalid_0's binary_logloss: 0.468244\n",
+ "[61]\tvalid_0's auc: 0.813302\tvalid_0's binary_logloss: 0.467499\n",
+ "[62]\tvalid_0's auc: 0.813221\tvalid_0's binary_logloss: 0.466758\n",
+ "[63]\tvalid_0's auc: 0.813697\tvalid_0's binary_logloss: 0.466017\n",
+ "[64]\tvalid_0's auc: 0.813985\tvalid_0's binary_logloss: 0.465501\n",
+ "[65]\tvalid_0's auc: 0.81416\tvalid_0's binary_logloss: 0.464725\n",
+ "[66]\tvalid_0's auc: 0.814227\tvalid_0's binary_logloss: 0.46398\n",
+ "[67]\tvalid_0's auc: 0.814397\tvalid_0's binary_logloss: 0.463309\n",
+ "[68]\tvalid_0's auc: 0.814426\tvalid_0's binary_logloss: 0.462627\n",
+ "[69]\tvalid_0's auc: 0.814593\tvalid_0's binary_logloss: 0.462244\n",
+ "[70]\tvalid_0's auc: 0.814789\tvalid_0's binary_logloss: 0.461571\n",
+ "[71]\tvalid_0's auc: 0.814889\tvalid_0's binary_logloss: 0.461144\n",
+ "[72]\tvalid_0's auc: 0.815078\tvalid_0's binary_logloss: 0.460684\n",
+ "[73]\tvalid_0's auc: 0.815439\tvalid_0's binary_logloss: 0.460063\n",
+ "[74]\tvalid_0's auc: 0.815511\tvalid_0's binary_logloss: 0.459386\n",
+ "[75]\tvalid_0's auc: 0.815574\tvalid_0's binary_logloss: 0.45877\n",
+ "[76]\tvalid_0's auc: 0.815634\tvalid_0's binary_logloss: 0.458128\n",
+ "[77]\tvalid_0's auc: 0.815618\tvalid_0's binary_logloss: 0.457495\n",
+ "[78]\tvalid_0's auc: 0.81582\tvalid_0's binary_logloss: 0.457057\n",
+ "[79]\tvalid_0's auc: 0.81594\tvalid_0's binary_logloss: 0.456475\n",
+ "[80]\tvalid_0's auc: 0.815961\tvalid_0's binary_logloss: 0.455885\n",
+ "[81]\tvalid_0's auc: 0.816153\tvalid_0's binary_logloss: 0.455511\n",
+ "[82]\tvalid_0's auc: 0.816433\tvalid_0's binary_logloss: 0.455186\n",
+ "[83]\tvalid_0's auc: 0.816546\tvalid_0's binary_logloss: 0.454625\n",
+ "[84]\tvalid_0's auc: 0.816586\tvalid_0's binary_logloss: 0.454039\n",
+ "[85]\tvalid_0's auc: 0.816584\tvalid_0's binary_logloss: 0.453482\n",
+ "[86]\tvalid_0's auc: 0.816881\tvalid_0's binary_logloss: 0.453048\n",
+ "[87]\tvalid_0's auc: 0.817029\tvalid_0's binary_logloss: 0.452485\n",
+ "[88]\tvalid_0's auc: 0.81707\tvalid_0's binary_logloss: 0.451941\n",
+ "[89]\tvalid_0's auc: 0.817298\tvalid_0's binary_logloss: 0.451544\n",
+ "[90]\tvalid_0's auc: 0.817343\tvalid_0's binary_logloss: 0.450975\n",
+ "[91]\tvalid_0's auc: 0.817357\tvalid_0's binary_logloss: 0.450422\n",
+ "[92]\tvalid_0's auc: 0.817592\tvalid_0's binary_logloss: 0.450109\n",
+ "[93]\tvalid_0's auc: 0.817729\tvalid_0's binary_logloss: 0.449542\n",
+ "[94]\tvalid_0's auc: 0.817834\tvalid_0's binary_logloss: 0.448982\n",
+ "[95]\tvalid_0's auc: 0.81809\tvalid_0's binary_logloss: 0.448398\n",
+ "[96]\tvalid_0's auc: 0.818269\tvalid_0's binary_logloss: 0.447908\n",
+ "[97]\tvalid_0's auc: 0.818682\tvalid_0's binary_logloss: 0.447547\n",
+ "[98]\tvalid_0's auc: 0.819015\tvalid_0's binary_logloss: 0.447165\n",
+ "[99]\tvalid_0's auc: 0.819016\tvalid_0's binary_logloss: 0.446669\n",
+ "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
+ "# 这一部分与前面的单独训练和验证是分开的\n",
+ "def get_kfold_users(trn_df, n=5):\n",
+ " user_ids = trn_df['user_id'].unique()\n",
+ " user_set = [user_ids[i::n] for i in range(n)]\n",
+ " return user_set\n",
+ "\n",
+ "k_fold = 5\n",
+ "trn_df = trn_user_item_feats_df_rank_model\n",
+ "user_set = get_kfold_users(trn_df, n=k_fold)\n",
+ "\n",
+ "score_list = []\n",
+ "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
+ "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
+ "\n",
+ "# 五折交叉验证,并将中间结果保存用于staking\n",
+ "for n_fold, valid_user in enumerate(user_set):\n",
+ " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
+ " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
+ " \n",
+ " # 模型及参数的定义\n",
+ " lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) \n",
+ " # 训练模型\n",
+ " lgb_Classfication.fit(train_idx[lgb_cols], train_idx['label'],eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], \n",
+ " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
+ " \n",
+ " # 预测验证集结果\n",
+ " valid_idx['pred_score'] = lgb_Classfication.predict_proba(valid_idx[lgb_cols], \n",
+ " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
+ " \n",
+ " # 对输出结果进行归一化 分类模型输出的值本身就是一个概率值不需要进行归一化\n",
+ " # valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
+ " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
+ " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
+ " \n",
+ " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
+ " if not offline:\n",
+ " sub_preds += lgb_Classfication.predict_proba(tst_user_item_feats_df_rank_model[lgb_cols], \n",
+ " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
+ " \n",
+ "score_df_ = pd.concat(score_list, axis=0)\n",
+ "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
+ "# 保存训练集交叉验证产生的新特征\n",
+ "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_cls_feats.csv', index=False)\n",
+ " \n",
+ "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
+ "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
+ "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ "\n",
+ "# 保存测试集交叉验证的新特征\n",
+ "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_cls_feats.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:24:23.074237Z",
+ "start_time": "2020-11-18T04:24:13.812284Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_cls')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DIN模型"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户的历史点击行为列表\n",
+ "这个是为后面的DIN模型服务的"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:24:30.508213Z",
+ "start_time": "2020-11-18T04:24:27.426372Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "if offline:\n",
+ " all_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
+ "else:\n",
+ " trn_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
+ " tst_data = pd.read_csv('./data_raw/testA_click_log.csv')\n",
+ " all_data = trn_data.append(tst_data)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:25:28.082071Z",
+ "start_time": "2020-11-18T04:24:33.649524Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "hist_click =all_data[['user_id', 'click_article_id']].groupby('user_id').agg({list}).reset_index()\n",
+ "his_behavior_df = pd.DataFrame()\n",
+ "his_behavior_df['user_id'] = hist_click['user_id']\n",
+ "his_behavior_df['hist_click_article_id'] = hist_click['click_article_id']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:25:52.925866Z",
+ "start_time": "2020-11-18T04:25:52.863922Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_feats_df_din_model = trn_user_item_feats_df.copy()\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_din_model = val_user_item_feats_df.copy()\n",
+ "else: \n",
+ " val_user_item_feats_df_din_model = None\n",
+ " \n",
+ "tst_user_item_feats_df_din_model = tst_user_item_feats_df.copy()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:00.070681Z",
+ "start_time": "2020-11-18T04:25:56.417197Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
+ "else:\n",
+ " val_user_item_feats_df_din_model = None\n",
+ "\n",
+ "tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### DIN模型简介\n",
+ "我们下面尝试使用DIN模型, DIN的全称是Deep Interest Network, 这是阿里2018年基于前面的深度学习模型无法表达用户多样化的兴趣而提出的一个模型, 它可以通过考虑【给定的候选广告】和【用户的历史行为】的相关性,来计算用户兴趣的表示向量。具体来说就是通过引入局部激活单元,通过软搜索历史行为的相关部分来关注相关的用户兴趣,并采用加权和来获得有关候选广告的用户兴趣的表示。与候选广告相关性较高的行为会获得较高的激活权重,并支配着用户兴趣。该表示向量在不同广告上有所不同,大大提高了模型的表达能力。所以该模型对于此次新闻推荐的任务也比较适合, 我们在这里通过当前的候选文章与用户历史点击文章的相关性来计算用户对于文章的兴趣。 该模型的结构如下:\n",
+ "\n",
+ "![image-20201116201646983](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)\n",
+ "\n",
+ "\n",
+ "我们这里直接调包来使用这个模型, 关于这个模型的详细细节部分我们会在下一期的推荐系统组队学习中给出。下面说一下该模型如何具体使用:deepctr的函数原型如下:\n",
+ "> def DIN(dnn_feature_columns, history_feature_list, dnn_use_bn=False,\n",
+ "> dnn_hidden_units=(200, 80), dnn_activation='relu', att_hidden_size=(80, 40), att_activation=\"dice\",\n",
+ "> att_weight_normalization=False, l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, seed=1024,\n",
+ "> task='binary'):\n",
+ "> \n",
+ "> * dnn_feature_columns: 特征列, 包含数据所有特征的列表\n",
+ "> * history_feature_list: 用户历史行为列, 反应用户历史行为的特征的列表\n",
+ "> * dnn_use_bn: 是否使用BatchNormalization\n",
+ "> * dnn_hidden_units: 全连接层网络的层数和每一层神经元的个数, 一个列表或者元组\n",
+ "> * dnn_activation_relu: 全连接网络的激活单元类型\n",
+ "> * att_hidden_size: 注意力层的全连接网络的层数和每一层神经元的个数\n",
+ "> * att_activation: 注意力层的激活单元类型\n",
+ "> * att_weight_normalization: 是否归一化注意力得分\n",
+ "> * l2_reg_dnn: 全连接网络的正则化系数\n",
+ "> * l2_reg_embedding: embedding向量的正则化稀疏\n",
+ "> * dnn_dropout: 全连接网络的神经元的失活概率\n",
+ "> * task: 任务, 可以是分类, 也可是是回归\n",
+ "\n",
+ "在具体使用的时候, 我们必须要传入特征列和历史行为列, 但是再传入之前, 我们需要进行一下特征列的预处理。具体如下:\n",
+ "\n",
+ "1. 首先,我们要处理数据集, 得到数据, 由于我们是基于用户过去的行为去预测用户是否点击当前文章, 所以我们需要把数据的特征列划分成数值型特征, 离散型特征和历史行为特征列三部分, 对于每一部分, DIN模型的处理会有不同\n",
+ " 1. 对于离散型特征, 在我们的数据集中就是那些类别型的特征, 比如user_id这种, 这种类别型特征, 我们首先要经过embedding处理得到每个特征的低维稠密型表示, 既然要经过embedding, 那么我们就需要为每一列的类别特征的取值建立一个字典,并指明embedding维度, 所以在使用deepctr的DIN模型准备数据的时候, 我们需要通过SparseFeat函数指明这些类别型特征, 这个函数的传入参数就是列名, 列的唯一取值(建立字典用)和embedding维度。\n",
+ " 2. 对于用户历史行为特征列, 比如文章id, 文章的类别等这种, 同样的我们需要先经过embedding处理, 只不过和上面不一样的地方是,对于这种特征, 我们在得到每个特征的embedding表示之后, 还需要通过一个Attention_layer计算用户的历史行为和当前候选文章的相关性以此得到当前用户的embedding向量, 这个向量就可以基于当前的候选文章与用户过去点击过得历史文章的相似性的程度来反应用户的兴趣, 并且随着用户的不同的历史点击来变化,去动态的模拟用户兴趣的变化过程。这类特征对于每个用户都是一个历史行为序列, 对于每个用户, 历史行为序列长度会不一样, 可能有的用户点击的历史文章多,有的点击的历史文章少, 所以我们还需要把这个长度统一起来, 在为DIN模型准备数据的时候, 我们首先要通过SparseFeat函数指明这些类别型特征, 然后还需要通过VarLenSparseFeat函数再进行序列填充, 使得每个用户的历史序列一样长, 所以这个函数参数中会有个maxlen,来指明序列的最大长度是多少。\n",
+ " 3. 对于连续型特征列, 我们只需要用DenseFeat函数来指明列名和维度即可。\n",
+ "2. 处理完特征列之后, 我们把相应的数据与列进行对应,就得到了最后的数据。\n",
+ "\n",
+ "下面根据具体的代码感受一下, 逻辑是这样, 首先我们需要写一个数据准备函数, 在这里面就是根据上面的具体步骤准备数据, 得到数据和特征列, 然后就是建立DIN模型并训练, 最后基于模型进行测试。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:08.405211Z",
+ "start_time": "2020-11-18T04:26:04.887013Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 导入deepctr\n",
+ "from deepctr.models import DIN\n",
+ "from deepctr.feature_column import SparseFeat, VarLenSparseFeat, DenseFeat, get_feature_names\n",
+ "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
+ "\n",
+ "from tensorflow.keras import backend as K\n",
+ "from tensorflow.keras.layers import *\n",
+ "from tensorflow.keras.models import *\n",
+ "from tensorflow.keras.callbacks import * \n",
+ "import tensorflow as tf\n",
+ "\n",
+ "import os\n",
+ "os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
+ "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"2\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:13.485712Z",
+ "start_time": "2020-11-18T04:26:13.476042Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 数据准备函数\n",
+ "def get_din_feats_columns(df, dense_fea, sparse_fea, behavior_fea, his_behavior_fea, emb_dim=32, max_len=100):\n",
+ " \"\"\"\n",
+ " 数据准备函数:\n",
+ " df: 数据集\n",
+ " dense_fea: 数值型特征列\n",
+ " sparse_fea: 离散型特征列\n",
+ " behavior_fea: 用户的候选行为特征列\n",
+ " his_behavior_fea: 用户的历史行为特征列\n",
+ " embedding_dim: embedding的维度, 这里为了简单, 统一把离散型特征列采用一样的隐向量维度\n",
+ " max_len: 用户序列的最大长度\n",
+ " \"\"\"\n",
+ " \n",
+ " sparse_feature_columns = [SparseFeat(feat, vocabulary_size=df[feat].nunique() + 1, embedding_dim=emb_dim) for feat in sparse_fea]\n",
+ " \n",
+ " dense_feature_columns = [DenseFeat(feat, 1, ) for feat in dense_fea]\n",
+ " \n",
+ " var_feature_columns = [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=df['click_article_id'].nunique() + 1,\n",
+ " embedding_dim=emb_dim, embedding_name='click_article_id'), maxlen=max_len) for feat in hist_behavior_fea]\n",
+ " \n",
+ " dnn_feature_columns = sparse_feature_columns + dense_feature_columns + var_feature_columns\n",
+ " \n",
+ " # 建立x, x是一个字典的形式\n",
+ " x = {}\n",
+ " for name in get_feature_names(dnn_feature_columns):\n",
+ " if name in his_behavior_fea:\n",
+ " # 这是历史行为序列\n",
+ " his_list = [l for l in df[name]]\n",
+ " x[name] = pad_sequences(his_list, maxlen=max_len, padding='post') # 二维数组\n",
+ " else:\n",
+ " x[name] = df[name].values\n",
+ " \n",
+ " return x, dnn_feature_columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:18.783217Z",
+ "start_time": "2020-11-18T04:26:18.776795Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 把特征分开\n",
+ "sparse_fea = ['user_id', 'click_article_id', 'category_id', 'click_environment', 'click_deviceGroup', \n",
+ " 'click_os', 'click_country', 'click_region', 'click_referrer_type', 'is_cat_hab']\n",
+ "\n",
+ "behavior_fea = ['click_article_id']\n",
+ "\n",
+ "hist_behavior_fea = ['hist_click_article_id']\n",
+ "\n",
+ "dense_fea = ['sim0', 'time_diff0', 'word_diff0', 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score',\n",
+ " 'rank','click_size','time_diff_mean','active_level','user_time_hob1','user_time_hob2',\n",
+ " 'words_hbo','words_count']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:25.469810Z",
+ "start_time": "2020-11-18T04:26:24.779347Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# dense特征进行归一化, 神经网络训练都需要将数值进行归一化处理\n",
+ "mm = MinMaxScaler()\n",
+ "\n",
+ "# 下面是做一些特殊处理,当在其他的地方出现无效值的时候,不处理无法进行归一化,刚开始可以先把他注释掉,在运行了下面的代码\n",
+ "# 之后如果发现报错,应该先去想办法处理如何不出现inf之类的值\n",
+ "# trn_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
+ "# tst_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
+ "\n",
+ "for feat in dense_fea:\n",
+ " trn_user_item_feats_df_din_model[feat] = mm.fit_transform(trn_user_item_feats_df_din_model[[feat]])\n",
+ " \n",
+ " if val_user_item_feats_df_din_model is not None:\n",
+ " val_user_item_feats_df_din_model[feat] = mm.fit_transform(val_user_item_feats_df_din_model[[feat]])\n",
+ " \n",
+ " tst_user_item_feats_df_din_model[feat] = mm.fit_transform(tst_user_item_feats_df_din_model[[feat]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:36.727753Z",
+ "start_time": "2020-11-18T04:26:28.854705Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 准备训练数据\n",
+ "x_trn, dnn_feature_columns = get_din_feats_columns(trn_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ "y_trn = trn_user_item_feats_df_din_model['label'].values\n",
+ "\n",
+ "if offline:\n",
+ " # 准备验证数据\n",
+ " x_val, dnn_feature_columns = get_din_feats_columns(val_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ " y_val = val_user_item_feats_df_din_model['label'].values\n",
+ " \n",
+ "dense_fea = [x for x in dense_fea if x != 'label']\n",
+ "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:45.146318Z",
+ "start_time": "2020-11-18T04:26:40.423914Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:255: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.where in 2.0, which has the same broadcast rule as np.where\n",
+ "Model: \"model\"\n",
+ "__________________________________________________________________________________________________\n",
+ "Layer (type) Output Shape Param # Connected to \n",
+ "==================================================================================================\n",
+ "user_id (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_article_id (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "category_id (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_environment (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_deviceGroup (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_os (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_country (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_region (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_referrer_type (InputLayer [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "is_cat_hab (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_user_id (Embedding) (None, 1, 32) 1600032 user_id[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_seq_emb_hist_click_artic multiple 525664 click_article_id[0][0] \n",
+ " hist_click_article_id[0][0] \n",
+ " click_article_id[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_category_id (Embeddi (None, 1, 32) 7776 category_id[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_environment (E (None, 1, 32) 128 click_environment[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_deviceGroup (E (None, 1, 32) 160 click_deviceGroup[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_os (Embedding) (None, 1, 32) 288 click_os[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_country (Embed (None, 1, 32) 384 click_country[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_region (Embedd (None, 1, 32) 928 click_region[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_referrer_type (None, 1, 32) 256 click_referrer_type[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_is_cat_hab (Embeddin (None, 1, 32) 64 is_cat_hab[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask (NoMask) (None, 1, 32) 0 sparse_emb_user_id[0][0] \n",
+ " sparse_seq_emb_hist_click_article\n",
+ " sparse_emb_category_id[0][0] \n",
+ " sparse_emb_click_environment[0][0\n",
+ " sparse_emb_click_deviceGroup[0][0\n",
+ " sparse_emb_click_os[0][0] \n",
+ " sparse_emb_click_country[0][0] \n",
+ " sparse_emb_click_region[0][0] \n",
+ " sparse_emb_click_referrer_type[0]\n",
+ " sparse_emb_is_cat_hab[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "hist_click_article_id (InputLay [(None, 50)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate (Concatenate) (None, 1, 320) 0 no_mask[0][0] \n",
+ " no_mask[1][0] \n",
+ " no_mask[2][0] \n",
+ " no_mask[3][0] \n",
+ " no_mask[4][0] \n",
+ " no_mask[5][0] \n",
+ " no_mask[6][0] \n",
+ " no_mask[7][0] \n",
+ " no_mask[8][0] \n",
+ " no_mask[9][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_1 (NoMask) (None, 1, 320) 0 concatenate[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "attention_sequence_pooling_laye (None, 1, 32) 13961 sparse_seq_emb_hist_click_article\n",
+ " sparse_seq_emb_hist_click_article\n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate_1 (Concatenate) (None, 1, 352) 0 no_mask_1[0][0] \n",
+ " attention_sequence_pooling_layer[\n",
+ "__________________________________________________________________________________________________\n",
+ "sim0 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "time_diff0 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "word_diff0 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_max (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_min (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_sum (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_mean (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "score (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "rank (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_size (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "time_diff_mean (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "active_level (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "user_time_hob1 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "user_time_hob2 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "words_hbo (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "words_count (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "flatten (Flatten) (None, 352) 0 concatenate_1[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_3 (NoMask) (None, 1) 0 sim0[0][0] \n",
+ " time_diff0[0][0] \n",
+ " word_diff0[0][0] \n",
+ " sim_max[0][0] \n",
+ " sim_min[0][0] \n",
+ " sim_sum[0][0] \n",
+ " sim_mean[0][0] \n",
+ " score[0][0] \n",
+ " rank[0][0] \n",
+ " click_size[0][0] \n",
+ " time_diff_mean[0][0] \n",
+ " active_level[0][0] \n",
+ " user_time_hob1[0][0] \n",
+ " user_time_hob2[0][0] \n",
+ " words_hbo[0][0] \n",
+ " words_count[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_2 (NoMask) (None, 352) 0 flatten[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate_2 (Concatenate) (None, 16) 0 no_mask_3[0][0] \n",
+ " no_mask_3[1][0] \n",
+ " no_mask_3[2][0] \n",
+ " no_mask_3[3][0] \n",
+ " no_mask_3[4][0] \n",
+ " no_mask_3[5][0] \n",
+ " no_mask_3[6][0] \n",
+ " no_mask_3[7][0] \n",
+ " no_mask_3[8][0] \n",
+ " no_mask_3[9][0] \n",
+ " no_mask_3[10][0] \n",
+ " no_mask_3[11][0] \n",
+ " no_mask_3[12][0] \n",
+ " no_mask_3[13][0] \n",
+ " no_mask_3[14][0] \n",
+ " no_mask_3[15][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "flatten_1 (Flatten) (None, 352) 0 no_mask_2[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "flatten_2 (Flatten) (None, 16) 0 concatenate_2[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_4 (NoMask) multiple 0 flatten_1[0][0] \n",
+ " flatten_2[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate_3 (Concatenate) (None, 368) 0 no_mask_4[0][0] \n",
+ " no_mask_4[1][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "dnn_1 (DNN) (None, 80) 89880 concatenate_3[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "dense (Dense) (None, 1) 80 dnn_1[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "prediction_layer (PredictionLay (None, 1) 1 dense[0][0] \n",
+ "==================================================================================================\n",
+ "Total params: 2,239,602\n",
+ "Trainable params: 2,239,362\n",
+ "Non-trainable params: 240\n",
+ "__________________________________________________________________________________________________\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 建立模型\n",
+ "model = DIN(dnn_feature_columns, behavior_fea)\n",
+ "\n",
+ "# 查看模型结构\n",
+ "model.summary()\n",
+ "\n",
+ "# 模型编译\n",
+ "model.compile('adam', 'binary_crossentropy',metrics=['binary_crossentropy', tf.keras.metrics.AUC()])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:28:43.885773Z",
+ "start_time": "2020-11-18T04:26:48.746787Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Epoch 1/2\n",
+ "290964/290964 [==============================] - 55s 189us/sample - loss: 0.4209 - binary_crossentropy: 0.4206 - auc: 0.7842\n",
+ "Epoch 2/2\n",
+ "290964/290964 [==============================] - 52s 178us/sample - loss: 0.3630 - binary_crossentropy: 0.3618 - auc: 0.8478\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 模型训练\n",
+ "if offline:\n",
+ " history = model.fit(x_trn, y_trn, verbose=1, epochs=10, validation_data=(x_val, y_val) , batch_size=256)\n",
+ "else:\n",
+ " # 也可以使用上面的语句用自己采样出来的验证集\n",
+ " # history = model.fit(x_trn, y_trn, verbose=1, epochs=3, validation_split=0.3, batch_size=256)\n",
+ " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, batch_size=256)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:29:20.436591Z",
+ "start_time": "2020-11-18T04:28:58.102057Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "500000/500000 [==============================] - 20s 39us/sample\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 模型预测\n",
+ "tst_user_item_feats_df_din_model['pred_score'] = model.predict(x_tst, verbose=1, batch_size=256)\n",
+ "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'din_rank_score.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:29:34.985535Z",
+ "start_time": "2020-11-18T04:29:26.264531Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']]\n",
+ "submit(rank_results, topk=5, model_name='din')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-15T06:15:49.490705Z",
+ "start_time": "2020-11-15T06:15:49.473794Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:38:53.760383Z",
+ "start_time": "2020-11-18T04:29:51.737721Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Train on 232681 samples, validate on 58283 samples\n",
+ "Epoch 1/2\n",
+ "232681/232681 [==============================] - 44s 189us/sample - loss: 0.2864 - binary_crossentropy: 0.2846 - auc: 0.9008 - val_loss: 0.2830 - val_binary_crossentropy: 0.2813 - val_auc: 0.9072\n",
+ "Epoch 2/2\n",
+ "232681/232681 [==============================] - 44s 187us/sample - loss: 0.2832 - binary_crossentropy: 0.2816 - auc: 0.9034 - val_loss: 0.2846 - val_binary_crossentropy: 0.2830 - val_auc: 0.9053\n",
+ "58283/58283 [==============================] - 2s 36us/sample\n",
+ "500000/500000 [==============================] - 19s 37us/sample\n",
+ "Train on 232798 samples, validate on 58166 samples\n",
+ "Epoch 1/2\n",
+ "232798/232798 [==============================] - 43s 184us/sample - loss: 0.2818 - binary_crossentropy: 0.2802 - auc: 0.9051 - val_loss: 0.2968 - val_binary_crossentropy: 0.2953 - val_auc: 0.9062\n",
+ "Epoch 2/2\n",
+ "232798/232798 [==============================] - 44s 187us/sample - loss: 0.2796 - binary_crossentropy: 0.2782 - auc: 0.9069 - val_loss: 0.2820 - val_binary_crossentropy: 0.2806 - val_auc: 0.9071\n",
+ "58166/58166 [==============================] - 2s 38us/sample\n",
+ "500000/500000 [==============================] - 18s 37us/sample\n",
+ "Train on 232847 samples, validate on 58117 samples\n",
+ "Epoch 1/2\n",
+ "232847/232847 [==============================] - 43s 185us/sample - loss: 0.2786 - binary_crossentropy: 0.2773 - auc: 0.9080 - val_loss: 0.2761 - val_binary_crossentropy: 0.2749 - val_auc: 0.9113\n",
+ "Epoch 2/2\n",
+ "232847/232847 [==============================] - 39s 166us/sample - loss: 0.2766 - binary_crossentropy: 0.2754 - auc: 0.9097 - val_loss: 0.2872 - val_binary_crossentropy: 0.2862 - val_auc: 0.9090\n",
+ "58117/58117 [==============================] - 2s 34us/sample\n",
+ "500000/500000 [==============================] - 17s 33us/sample\n",
+ "Train on 232716 samples, validate on 58248 samples\n",
+ "Epoch 1/2\n",
+ "232716/232716 [==============================] - 39s 169us/sample - loss: 0.2763 - binary_crossentropy: 0.2753 - auc: 0.9100 - val_loss: 0.2739 - val_binary_crossentropy: 0.2730 - val_auc: 0.9116\n",
+ "Epoch 2/2\n",
+ "232716/232716 [==============================] - 39s 168us/sample - loss: 0.2743 - binary_crossentropy: 0.2735 - auc: 0.9119 - val_loss: 0.2859 - val_binary_crossentropy: 0.2851 - val_auc: 0.9090\n",
+ "58248/58248 [==============================] - 2s 35us/sample\n",
+ "500000/500000 [==============================] - 17s 34us/sample\n",
+ "Train on 232814 samples, validate on 58150 samples\n",
+ "Epoch 1/2\n",
+ "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2747 - binary_crossentropy: 0.2739 - auc: 0.9115 - val_loss: 0.2702 - val_binary_crossentropy: 0.2695 - val_auc: 0.9163\n",
+ "Epoch 2/2\n",
+ "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2725 - binary_crossentropy: 0.2719 - auc: 0.9132 - val_loss: 0.2751 - val_binary_crossentropy: 0.2745 - val_auc: 0.9151\n",
+ "58150/58150 [==============================] - 2s 34us/sample\n",
+ "500000/500000 [==============================] - 17s 34us/sample\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
+ "# 这一部分与前面的单独训练和验证是分开的\n",
+ "def get_kfold_users(trn_df, n=5):\n",
+ " user_ids = trn_df['user_id'].unique()\n",
+ " user_set = [user_ids[i::n] for i in range(n)]\n",
+ " return user_set\n",
+ "\n",
+ "k_fold = 5\n",
+ "trn_df = trn_user_item_feats_df_din_model\n",
+ "user_set = get_kfold_users(trn_df, n=k_fold)\n",
+ "\n",
+ "score_list = []\n",
+ "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
+ "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
+ "\n",
+ "dense_fea = [x for x in dense_fea if x != 'label']\n",
+ "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ "\n",
+ "# 五折交叉验证,并将中间结果保存用于staking\n",
+ "for n_fold, valid_user in enumerate(user_set):\n",
+ " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
+ " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
+ " \n",
+ " # 准备训练数据\n",
+ " x_trn, dnn_feature_columns = get_din_feats_columns(train_idx, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ " y_trn = train_idx['label'].values\n",
+ "\n",
+ " # 准备验证数据\n",
+ " x_val, dnn_feature_columns = get_din_feats_columns(valid_idx, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ " y_val = valid_idx['label'].values\n",
+ " \n",
+ " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, validation_data=(x_val, y_val) , batch_size=256)\n",
+ " \n",
+ " # 预测验证集结果\n",
+ " valid_idx['pred_score'] = model.predict(x_val, verbose=1, batch_size=256) \n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
+ " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
+ " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
+ " \n",
+ " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
+ " if not offline:\n",
+ " sub_preds += model.predict(x_tst, verbose=1, batch_size=256)[:, 0] \n",
+ " \n",
+ "score_df_ = pd.concat(score_list, axis=0)\n",
+ "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
+ "# 保存训练集交叉验证产生的新特征\n",
+ "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_din_cls_feats.csv', index=False)\n",
+ " \n",
+ "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
+ "tst_user_item_feats_df_din_model['pred_score'] = sub_preds / k_fold\n",
+ "tst_user_item_feats_df_din_model['pred_score'] = tst_user_item_feats_df_din_model['pred_score'].transform(lambda x: norm_sim(x))\n",
+ "tst_user_item_feats_df_din_model.sort_values(by=['user_id', 'pred_score'])\n",
+ "tst_user_item_feats_df_din_model['pred_rank'] = tst_user_item_feats_df_din_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ "\n",
+ "# 保存测试集交叉验证的新特征\n",
+ "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_din_cls_feats.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 模型融合"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 加权融合"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:27.351996Z",
+ "start_time": "2020-11-18T04:44:26.561275Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取多个模型的排序结果文件\n",
+ "lgb_ranker = pd.read_csv(save_path + 'lgb_ranker_score.csv')\n",
+ "lgb_cls = pd.read_csv(save_path + 'lgb_cls_score.csv')\n",
+ "din_ranker = pd.read_csv(save_path + 'din_rank_score.csv')\n",
+ "\n",
+ "# 这里也可以换成交叉验证输出的测试结果进行加权融合"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:31.593981Z",
+ "start_time": "2020-11-18T04:44:31.589439Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "rank_model = {'lgb_ranker': lgb_ranker, \n",
+ " 'lgb_cls': lgb_cls, \n",
+ " 'din_ranker': din_ranker}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:36.135860Z",
+ "start_time": "2020-11-18T04:44:36.130577Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_ensumble_predict_topk(rank_model, topk=5):\n",
+ " final_recall = rank_model['lgb_cls'].append(rank_model['din_ranker'])\n",
+ " rank_model['lgb_ranker']['pred_score'] = rank_model['lgb_ranker']['pred_score'].transform(lambda x: norm_sim(x))\n",
+ " \n",
+ " final_recall = final_recall.append(rank_model['lgb_ranker'])\n",
+ " final_recall = final_recall.groupby(['user_id', 'click_article_id'])['pred_score'].sum().reset_index()\n",
+ " \n",
+ " submit(final_recall, topk=topk, model_name='ensemble_fuse')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:51.659270Z",
+ "start_time": "2020-11-18T04:44:40.445659Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "get_ensumble_predict_topk(rank_model)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Staking"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:58.025992Z",
+ "start_time": "2020-11-18T04:44:56.146962Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取多个模型的交叉验证生成的结果文件\n",
+ "# 训练集\n",
+ "trn_lgb_ranker_feats = pd.read_csv(save_path + 'trn_lgb_ranker_feats.csv')\n",
+ "trn_lgb_cls_feats = pd.read_csv(save_path + 'trn_lgb_cls_feats.csv')\n",
+ "trn_din_cls_feats = pd.read_csv(save_path + 'trn_din_cls_feats.csv')\n",
+ "\n",
+ "# 测试集\n",
+ "tst_lgb_ranker_feats = pd.read_csv(save_path + 'tst_lgb_ranker_feats.csv')\n",
+ "tst_lgb_cls_feats = pd.read_csv(save_path + 'tst_lgb_cls_feats.csv')\n",
+ "tst_din_cls_feats = pd.read_csv(save_path + 'tst_din_cls_feats.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:45:07.701862Z",
+ "start_time": "2020-11-18T04:45:07.644335Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 将多个模型输出的特征进行拼接\n",
+ "\n",
+ "finall_trn_ranker_feats = trn_lgb_ranker_feats[['user_id', 'click_article_id', 'label']]\n",
+ "finall_tst_ranker_feats = tst_lgb_ranker_feats[['user_id', 'click_article_id']]\n",
+ "\n",
+ "for idx, trn_model in enumerate([trn_lgb_ranker_feats, trn_lgb_cls_feats, trn_din_cls_feats]):\n",
+ " for feat in [ 'pred_score', 'pred_rank']:\n",
+ " col_name = feat + '_' + str(idx)\n",
+ " finall_trn_ranker_feats[col_name] = trn_model[feat]\n",
+ "\n",
+ "for idx, tst_model in enumerate([tst_lgb_ranker_feats, tst_lgb_cls_feats, tst_din_cls_feats]):\n",
+ " for feat in [ 'pred_score', 'pred_rank']:\n",
+ " col_name = feat + '_' + str(idx)\n",
+ " finall_tst_ranker_feats[col_name] = tst_model[feat]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:45:15.044242Z",
+ "start_time": "2020-11-18T04:45:13.138252Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 定义一个逻辑回归模型再次拟合交叉验证产生的特征对测试集进行预测\n",
+ "# 这里需要注意的是,在做交叉验证的时候可以构造多一些与输出预测值相关的特征,来丰富这里简单模型的特征\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "\n",
+ "feat_cols = ['pred_score_0', 'pred_rank_0', 'pred_score_1', 'pred_rank_1', 'pred_score_2', 'pred_rank_2']\n",
+ "\n",
+ "trn_x = finall_trn_ranker_feats[feat_cols]\n",
+ "trn_y = finall_trn_ranker_feats['label']\n",
+ "\n",
+ "tst_x = finall_tst_ranker_feats[feat_cols]\n",
+ "\n",
+ "# 定义模型\n",
+ "lr = LogisticRegression()\n",
+ "\n",
+ "# 模型训练\n",
+ "lr.fit(trn_x, trn_y)\n",
+ "\n",
+ "# 模型预测\n",
+ "finall_tst_ranker_feats['pred_score'] = lr.predict_proba(tst_x)[:, 1]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:45:29.018764Z",
+ "start_time": "2020-11-18T04:45:19.423130Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = finall_tst_ranker_feats[['user_id', 'click_article_id', 'pred_score']]\n",
+ "submit(rank_results, topk=5, model_name='ensumble_staking')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 总结\n",
+ "本章主要学习了三个排序模型,包括LGB的Rank, LGB的Classifier还有深度学习的DIN模型, 当然,对于这三个模型的原理部分,我们并没有给出详细的介绍, 请大家课下自己探索原理,也欢迎大家把自己的探索与所学分享出来,我们一块学习和进步。最后,我们进行了简单的模型融合策略,包括简单的加权和Stacking。\n",
+ "\n",
+ "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = finall_tst_ranker_feats[['user_id', 'click_article_id', 'pred_score']]\n",
- "submit(rank_results, topk=5, model_name='ensumble_staking')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 总结\n",
- "本章主要学习了三个排序模型,包括LGB的Rank, LGB的Classifier还有深度学习的DIN模型, 当然,对于这三个模型的原理部分,我们并没有给出详细的介绍, 请大家课下自己探索原理,也欢迎大家把自己的探索与所学分享出来,我们一块学习和进步。最后,我们进行了简单的模型融合策略,包括简单的加权和Stacking。\n",
- "\n",
- "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.8"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "170px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
},
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "170px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
\ No newline at end of file
diff --git "a/docs/ch03/ch3.1/jupyter/\346\225\260\346\215\256\345\210\206\346\236\220.ipynb" "b/docs/ch03/ch3.1/jupyter/\346\225\260\346\215\256\345\210\206\346\236\220.ipynb"
index c9cbc0c37..6bc2d7d2b 100644
--- "a/docs/ch03/ch3.1/jupyter/\346\225\260\346\215\256\345\210\206\346\236\220.ipynb"
+++ "b/docs/ch03/ch3.1/jupyter/\346\225\260\346\215\256\345\210\206\346\236\220.ipynb"
@@ -1,3980 +1,3980 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 数据分析\n",
- "\n",
- "数据分析的价值主要在于熟悉了解整个数据集的基本情况包括每个文件里有哪些数据,具体的文件中的每个字段表示什么实际含义,以及数据集中特征之间的相关性,在推荐场景下主要就是分析用户本身的基本属性,文章基本属性,以及用户和文章交互的一些分布,这些都有利于后面的召回策略的选择,以及特征工程。\n",
- "\n",
- "**建议:当特征工程和模型调参已经很难继续上分了,可以回来在重新从新的角度去分析这些数据,或许可以找到上分的灵感**\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:13:59.322486Z",
- "start_time": "2020-11-13T15:13:55.601445Z"
- }
- },
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "import pandas as pd\n",
- "import numpy as np\n",
- "\n",
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "plt.rc('font', family='SimHei', size=13)\n",
- "\n",
- "import os,gc,re,warnings,sys\n",
- "warnings.filterwarnings(\"ignore\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:14:18.918041Z",
- "start_time": "2020-11-13T15:14:02.568798Z"
- }
- },
- "outputs": [],
- "source": [
- "# path = './data/' # 自定义的路径\n",
- "path = './' # 天池平台路径\n",
- "\n",
- "#####train\n",
- "trn_click = pd.read_csv(path+'train_click_log.csv')\n",
- "#trn_click = pd.read_csv(path+'train_click_log.csv', names=['user_id','item_id','click_time','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type'])\n",
- "item_df = pd.read_csv(path+'articles.csv')\n",
- "item_df = item_df.rename(columns={'article_id': 'click_article_id'}) #重命名,方便后续match\n",
- "item_emb_df = pd.read_csv(path+'articles_emb.csv')\n",
- "\n",
- "#####test\n",
- "tst_click = pd.read_csv(path+'testA_click_log.csv')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 数据预处理\n",
- "计算用户点击rank和点击次数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:14:31.746748Z",
- "start_time": "2020-11-13T15:14:31.409643Z"
- }
- },
- "outputs": [],
- "source": [
- "# 对每个用户的点击时间戳进行排序\n",
- "trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)\n",
- "tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:15:04.503079Z",
- "start_time": "2020-11-13T15:15:04.394329Z"
- }
- },
- "outputs": [],
- "source": [
- "#计算用户点击文章的次数,并添加新的一列count\n",
- "trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')\n",
- "tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 数据浏览"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击日志文件_训练集"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:16:07.764776Z",
- "start_time": "2020-11-13T15:16:07.536342Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 199999 \n",
- " 160417 \n",
- " 1507029570190 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 1 \n",
- " 11 \n",
- " 11 \n",
- " 281 \n",
- " 1506942089000 \n",
- " 173 \n",
- " \n",
- " \n",
- " 1 \n",
- " 199999 \n",
- " 5408 \n",
- " 1507029571478 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 1 \n",
- " 10 \n",
- " 11 \n",
- " 4 \n",
- " 1506994257000 \n",
- " 118 \n",
- " \n",
- " \n",
- " 2 \n",
- " 199999 \n",
- " 50823 \n",
- " 1507029601478 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 1 \n",
- " 9 \n",
- " 11 \n",
- " 99 \n",
- " 1507013614000 \n",
- " 213 \n",
- " \n",
- " \n",
- " 3 \n",
- " 199998 \n",
- " 157770 \n",
- " 1507029532200 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 25 \n",
- " 5 \n",
- " 40 \n",
- " 40 \n",
- " 281 \n",
- " 1506983935000 \n",
- " 201 \n",
- " \n",
- " \n",
- " 4 \n",
- " 199998 \n",
- " 96613 \n",
- " 1507029671831 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 25 \n",
- " 5 \n",
- " 39 \n",
- " 40 \n",
- " 209 \n",
- " 1506938444000 \n",
- " 185 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 数据分析\n",
+ "\n",
+ "数据分析的价值主要在于熟悉了解整个数据集的基本情况包括每个文件里有哪些数据,具体的文件中的每个字段表示什么实际含义,以及数据集中特征之间的相关性,在推荐场景下主要就是分析用户本身的基本属性,文章基本属性,以及用户和文章交互的一些分布,这些都有利于后面的召回策略的选择,以及特征工程。\n",
+ "\n",
+ "**建议:当特征工程和模型调参已经很难继续上分了,可以回来在重新从新的角度去分析这些数据,或许可以找到上分的灵感**\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 导包"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:13:59.322486Z",
+ "start_time": "2020-11-13T15:13:55.601445Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%matplotlib inline\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "plt.rc('font', family='SimHei', size=13)\n",
+ "\n",
+ "import os,gc,re,warnings,sys\n",
+ "warnings.filterwarnings(\"ignore\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取数据"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:14:18.918041Z",
+ "start_time": "2020-11-13T15:14:02.568798Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# path = './data/' # 自定义的路径\n",
+ "path = './' # 天池平台路径\n",
+ "\n",
+ "#####train\n",
+ "trn_click = pd.read_csv(path+'train_click_log.csv')\n",
+ "#trn_click = pd.read_csv(path+'train_click_log.csv', names=['user_id','item_id','click_time','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type'])\n",
+ "item_df = pd.read_csv(path+'articles.csv')\n",
+ "item_df = item_df.rename(columns={'article_id': 'click_article_id'}) #重命名,方便后续match\n",
+ "item_emb_df = pd.read_csv(path+'articles_emb.csv')\n",
+ "\n",
+ "#####test\n",
+ "tst_click = pd.read_csv(path+'testA_click_log.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 数据预处理\n",
+ "计算用户点击rank和点击次数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:14:31.746748Z",
+ "start_time": "2020-11-13T15:14:31.409643Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 对每个用户的点击时间戳进行排序\n",
+ "trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)\n",
+ "tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:15:04.503079Z",
+ "start_time": "2020-11-13T15:15:04.394329Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "#计算用户点击文章的次数,并添加新的一列count\n",
+ "trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')\n",
+ "tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 数据浏览"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击日志文件_训练集"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:16:07.764776Z",
+ "start_time": "2020-11-13T15:16:07.536342Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 199999 \n",
+ " 160417 \n",
+ " 1507029570190 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 1 \n",
+ " 11 \n",
+ " 11 \n",
+ " 281 \n",
+ " 1506942089000 \n",
+ " 173 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 199999 \n",
+ " 5408 \n",
+ " 1507029571478 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 1 \n",
+ " 10 \n",
+ " 11 \n",
+ " 4 \n",
+ " 1506994257000 \n",
+ " 118 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 199999 \n",
+ " 50823 \n",
+ " 1507029601478 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 1 \n",
+ " 9 \n",
+ " 11 \n",
+ " 99 \n",
+ " 1507013614000 \n",
+ " 213 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 199998 \n",
+ " 157770 \n",
+ " 1507029532200 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 25 \n",
+ " 5 \n",
+ " 40 \n",
+ " 40 \n",
+ " 281 \n",
+ " 1506983935000 \n",
+ " 201 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 199998 \n",
+ " 96613 \n",
+ " 1507029671831 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 25 \n",
+ " 5 \n",
+ " 39 \n",
+ " 40 \n",
+ " 209 \n",
+ " 1506938444000 \n",
+ " 185 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "0 199999 160417 1507029570190 4 \n",
+ "1 199999 5408 1507029571478 4 \n",
+ "2 199999 50823 1507029601478 4 \n",
+ "3 199998 157770 1507029532200 4 \n",
+ "4 199998 96613 1507029671831 4 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "0 1 17 1 13 \n",
+ "1 1 17 1 13 \n",
+ "2 1 17 1 13 \n",
+ "3 1 17 1 25 \n",
+ "4 1 17 1 25 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
+ "0 1 11 11 281 1506942089000 \n",
+ "1 1 10 11 4 1506994257000 \n",
+ "2 1 9 11 99 1507013614000 \n",
+ "3 5 40 40 281 1506983935000 \n",
+ "4 5 39 40 209 1506938444000 \n",
+ "\n",
+ " words_count \n",
+ "0 173 \n",
+ "1 118 \n",
+ "2 213 \n",
+ "3 201 \n",
+ "4 185 "
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "0 199999 160417 1507029570190 4 \n",
- "1 199999 5408 1507029571478 4 \n",
- "2 199999 50823 1507029601478 4 \n",
- "3 199998 157770 1507029532200 4 \n",
- "4 199998 96613 1507029671831 4 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "0 1 17 1 13 \n",
- "1 1 17 1 13 \n",
- "2 1 17 1 13 \n",
- "3 1 17 1 25 \n",
- "4 1 17 1 25 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
- "0 1 11 11 281 1506942089000 \n",
- "1 1 10 11 4 1506994257000 \n",
- "2 1 9 11 99 1507013614000 \n",
- "3 5 40 40 281 1506983935000 \n",
- "4 5 39 40 209 1506938444000 \n",
- "\n",
- " words_count \n",
- "0 173 \n",
- "1 118 \n",
- "2 213 \n",
- "3 201 \n",
- "4 185 "
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])\n",
- "trn_click.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### train_click_log.csv文件数据中每个字段的含义\n",
- "\n",
- "1. user_id: 用户的唯一标识\n",
- "2. click_article_id: 用户点击的文章唯一标识\n",
- "3. click_timestamp: 用户点击文章时的时间戳\n",
- "4. click_environment: 用户点击文章的环境\n",
- "5. click_deviceGroup: 用户点击文章的设备组\n",
- "6. click_os: 用户点击文章时的操作系统\n",
- "7. click_country: 用户点击文章时的所在的国家\n",
- "8. click_region: 用户点击文章时所在的区域\n",
- "9. click_referrer_type: 用户点击文章时,文章的来源"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:16:18.536902Z",
- "start_time": "2020-11-13T15:16:18.424203Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Int64Index: 1112623 entries, 0 to 1112622\n",
- "Data columns (total 14 columns):\n",
- "user_id 1112623 non-null int64\n",
- "click_article_id 1112623 non-null int64\n",
- "click_timestamp 1112623 non-null int64\n",
- "click_environment 1112623 non-null int64\n",
- "click_deviceGroup 1112623 non-null int64\n",
- "click_os 1112623 non-null int64\n",
- "click_country 1112623 non-null int64\n",
- "click_region 1112623 non-null int64\n",
- "click_referrer_type 1112623 non-null int64\n",
- "rank 1112623 non-null int64\n",
- "click_cnts 1112623 non-null int64\n",
- "category_id 1112623 non-null int64\n",
- "created_at_ts 1112623 non-null int64\n",
- "words_count 1112623 non-null int64\n",
- "dtypes: int64(14)\n",
- "memory usage: 127.3 MB\n"
- ]
- }
- ],
- "source": [
- "#用户点击日志信息\n",
- "trn_click.info()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " \n",
- " \n",
- " mean \n",
- " 1.221198e+05 \n",
- " 1.951541e+05 \n",
- " 1.507588e+12 \n",
- " 3.947786e+00 \n",
- " 1.815981e+00 \n",
- " 1.301976e+01 \n",
- " 1.310776e+00 \n",
- " 1.813587e+01 \n",
- " 1.910063e+00 \n",
- " 7.118518e+00 \n",
- " 1.323704e+01 \n",
- " 3.056176e+02 \n",
- " 1.506598e+12 \n",
- " 2.011981e+02 \n",
- " \n",
- " \n",
- " std \n",
- " 5.540349e+04 \n",
- " 9.292286e+04 \n",
- " 3.363466e+08 \n",
- " 3.276715e-01 \n",
- " 1.035170e+00 \n",
- " 6.967844e+00 \n",
- " 1.618264e+00 \n",
- " 7.105832e+00 \n",
- " 1.220012e+00 \n",
- " 1.016095e+01 \n",
- " 1.631503e+01 \n",
- " 1.155791e+02 \n",
- " 8.343066e+09 \n",
- " 5.223881e+01 \n",
- " \n",
- " \n",
- " min \n",
- " 0.000000e+00 \n",
- " 3.000000e+00 \n",
- " 1.507030e+12 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.166573e+12 \n",
- " 0.000000e+00 \n",
- " \n",
- " \n",
- " 25% \n",
- " 7.934700e+04 \n",
- " 1.239090e+05 \n",
- " 1.507297e+12 \n",
- " 4.000000e+00 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.300000e+01 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 4.000000e+00 \n",
- " 2.500000e+02 \n",
- " 1.507220e+12 \n",
- " 1.700000e+02 \n",
- " \n",
- " \n",
- " 50% \n",
- " 1.309670e+05 \n",
- " 2.038900e+05 \n",
- " 1.507596e+12 \n",
- " 4.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.700000e+01 \n",
- " 1.000000e+00 \n",
- " 2.100000e+01 \n",
- " 2.000000e+00 \n",
- " 4.000000e+00 \n",
- " 8.000000e+00 \n",
- " 3.280000e+02 \n",
- " 1.507553e+12 \n",
- " 1.970000e+02 \n",
- " \n",
- " \n",
- " 75% \n",
- " 1.704010e+05 \n",
- " 2.777120e+05 \n",
- " 1.507841e+12 \n",
- " 4.000000e+00 \n",
- " 3.000000e+00 \n",
- " 1.700000e+01 \n",
- " 1.000000e+00 \n",
- " 2.500000e+01 \n",
- " 2.000000e+00 \n",
- " 8.000000e+00 \n",
- " 1.600000e+01 \n",
- " 4.100000e+02 \n",
- " 1.507756e+12 \n",
- " 2.280000e+02 \n",
- " \n",
- " \n",
- " max \n",
- " 1.999990e+05 \n",
- " 3.640460e+05 \n",
- " 1.510603e+12 \n",
- " 4.000000e+00 \n",
- " 5.000000e+00 \n",
- " 2.000000e+01 \n",
- " 1.100000e+01 \n",
- " 2.800000e+01 \n",
- " 7.000000e+00 \n",
- " 2.410000e+02 \n",
- " 2.410000e+02 \n",
- " 4.600000e+02 \n",
- " 1.510666e+12 \n",
- " 6.690000e+03 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])\n",
+ "trn_click.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### train_click_log.csv文件数据中每个字段的含义\n",
+ "\n",
+ "1. user_id: 用户的唯一标识\n",
+ "2. click_article_id: 用户点击的文章唯一标识\n",
+ "3. click_timestamp: 用户点击文章时的时间戳\n",
+ "4. click_environment: 用户点击文章的环境\n",
+ "5. click_deviceGroup: 用户点击文章的设备组\n",
+ "6. click_os: 用户点击文章时的操作系统\n",
+ "7. click_country: 用户点击文章时的所在的国家\n",
+ "8. click_region: 用户点击文章时所在的区域\n",
+ "9. click_referrer_type: 用户点击文章时,文章的来源"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:16:18.536902Z",
+ "start_time": "2020-11-13T15:16:18.424203Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Int64Index: 1112623 entries, 0 to 1112622\n",
+ "Data columns (total 14 columns):\n",
+ "user_id 1112623 non-null int64\n",
+ "click_article_id 1112623 non-null int64\n",
+ "click_timestamp 1112623 non-null int64\n",
+ "click_environment 1112623 non-null int64\n",
+ "click_deviceGroup 1112623 non-null int64\n",
+ "click_os 1112623 non-null int64\n",
+ "click_country 1112623 non-null int64\n",
+ "click_region 1112623 non-null int64\n",
+ "click_referrer_type 1112623 non-null int64\n",
+ "rank 1112623 non-null int64\n",
+ "click_cnts 1112623 non-null int64\n",
+ "category_id 1112623 non-null int64\n",
+ "created_at_ts 1112623 non-null int64\n",
+ "words_count 1112623 non-null int64\n",
+ "dtypes: int64(14)\n",
+ "memory usage: 127.3 MB\n"
+ ]
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
- "mean 1.221198e+05 1.951541e+05 1.507588e+12 3.947786e+00 \n",
- "std 5.540349e+04 9.292286e+04 3.363466e+08 3.276715e-01 \n",
- "min 0.000000e+00 3.000000e+00 1.507030e+12 1.000000e+00 \n",
- "25% 7.934700e+04 1.239090e+05 1.507297e+12 4.000000e+00 \n",
- "50% 1.309670e+05 2.038900e+05 1.507596e+12 4.000000e+00 \n",
- "75% 1.704010e+05 2.777120e+05 1.507841e+12 4.000000e+00 \n",
- "max 1.999990e+05 3.640460e+05 1.510603e+12 4.000000e+00 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
- "mean 1.815981e+00 1.301976e+01 1.310776e+00 1.813587e+01 \n",
- "std 1.035170e+00 6.967844e+00 1.618264e+00 7.105832e+00 \n",
- "min 1.000000e+00 2.000000e+00 1.000000e+00 1.000000e+00 \n",
- "25% 1.000000e+00 2.000000e+00 1.000000e+00 1.300000e+01 \n",
- "50% 1.000000e+00 1.700000e+01 1.000000e+00 2.100000e+01 \n",
- "75% 3.000000e+00 1.700000e+01 1.000000e+00 2.500000e+01 \n",
- "max 5.000000e+00 2.000000e+01 1.100000e+01 2.800000e+01 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id \\\n",
- "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
- "mean 1.910063e+00 7.118518e+00 1.323704e+01 3.056176e+02 \n",
- "std 1.220012e+00 1.016095e+01 1.631503e+01 1.155791e+02 \n",
- "min 1.000000e+00 1.000000e+00 2.000000e+00 1.000000e+00 \n",
- "25% 1.000000e+00 2.000000e+00 4.000000e+00 2.500000e+02 \n",
- "50% 2.000000e+00 4.000000e+00 8.000000e+00 3.280000e+02 \n",
- "75% 2.000000e+00 8.000000e+00 1.600000e+01 4.100000e+02 \n",
- "max 7.000000e+00 2.410000e+02 2.410000e+02 4.600000e+02 \n",
- "\n",
- " created_at_ts words_count \n",
- "count 1.112623e+06 1.112623e+06 \n",
- "mean 1.506598e+12 2.011981e+02 \n",
- "std 8.343066e+09 5.223881e+01 \n",
- "min 1.166573e+12 0.000000e+00 \n",
- "25% 1.507220e+12 1.700000e+02 \n",
- "50% 1.507553e+12 1.970000e+02 \n",
- "75% 1.507756e+12 2.280000e+02 \n",
- "max 1.510666e+12 6.690000e+03 "
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click.describe()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "200000"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#训练集中的用户数量为20w\n",
- "trn_click.user_id.nunique()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T16:03:01.378461Z",
- "start_time": "2020-11-13T16:03:01.300712Z"
- }
- },
- "outputs": [
+ "source": [
+ "#用户点击日志信息\n",
+ "trn_click.info()"
+ ]
+ },
{
- "data": {
- "text/plain": [
- "2"
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 1.221198e+05 \n",
+ " 1.951541e+05 \n",
+ " 1.507588e+12 \n",
+ " 3.947786e+00 \n",
+ " 1.815981e+00 \n",
+ " 1.301976e+01 \n",
+ " 1.310776e+00 \n",
+ " 1.813587e+01 \n",
+ " 1.910063e+00 \n",
+ " 7.118518e+00 \n",
+ " 1.323704e+01 \n",
+ " 3.056176e+02 \n",
+ " 1.506598e+12 \n",
+ " 2.011981e+02 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 5.540349e+04 \n",
+ " 9.292286e+04 \n",
+ " 3.363466e+08 \n",
+ " 3.276715e-01 \n",
+ " 1.035170e+00 \n",
+ " 6.967844e+00 \n",
+ " 1.618264e+00 \n",
+ " 7.105832e+00 \n",
+ " 1.220012e+00 \n",
+ " 1.016095e+01 \n",
+ " 1.631503e+01 \n",
+ " 1.155791e+02 \n",
+ " 8.343066e+09 \n",
+ " 5.223881e+01 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 0.000000e+00 \n",
+ " 3.000000e+00 \n",
+ " 1.507030e+12 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.166573e+12 \n",
+ " 0.000000e+00 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 7.934700e+04 \n",
+ " 1.239090e+05 \n",
+ " 1.507297e+12 \n",
+ " 4.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.300000e+01 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 4.000000e+00 \n",
+ " 2.500000e+02 \n",
+ " 1.507220e+12 \n",
+ " 1.700000e+02 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 1.309670e+05 \n",
+ " 2.038900e+05 \n",
+ " 1.507596e+12 \n",
+ " 4.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.700000e+01 \n",
+ " 1.000000e+00 \n",
+ " 2.100000e+01 \n",
+ " 2.000000e+00 \n",
+ " 4.000000e+00 \n",
+ " 8.000000e+00 \n",
+ " 3.280000e+02 \n",
+ " 1.507553e+12 \n",
+ " 1.970000e+02 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 1.704010e+05 \n",
+ " 2.777120e+05 \n",
+ " 1.507841e+12 \n",
+ " 4.000000e+00 \n",
+ " 3.000000e+00 \n",
+ " 1.700000e+01 \n",
+ " 1.000000e+00 \n",
+ " 2.500000e+01 \n",
+ " 2.000000e+00 \n",
+ " 8.000000e+00 \n",
+ " 1.600000e+01 \n",
+ " 4.100000e+02 \n",
+ " 1.507756e+12 \n",
+ " 2.280000e+02 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 1.999990e+05 \n",
+ " 3.640460e+05 \n",
+ " 1.510603e+12 \n",
+ " 4.000000e+00 \n",
+ " 5.000000e+00 \n",
+ " 2.000000e+01 \n",
+ " 1.100000e+01 \n",
+ " 2.800000e+01 \n",
+ " 7.000000e+00 \n",
+ " 2.410000e+02 \n",
+ " 2.410000e+02 \n",
+ " 4.600000e+02 \n",
+ " 1.510666e+12 \n",
+ " 6.690000e+03 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
+ "mean 1.221198e+05 1.951541e+05 1.507588e+12 3.947786e+00 \n",
+ "std 5.540349e+04 9.292286e+04 3.363466e+08 3.276715e-01 \n",
+ "min 0.000000e+00 3.000000e+00 1.507030e+12 1.000000e+00 \n",
+ "25% 7.934700e+04 1.239090e+05 1.507297e+12 4.000000e+00 \n",
+ "50% 1.309670e+05 2.038900e+05 1.507596e+12 4.000000e+00 \n",
+ "75% 1.704010e+05 2.777120e+05 1.507841e+12 4.000000e+00 \n",
+ "max 1.999990e+05 3.640460e+05 1.510603e+12 4.000000e+00 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
+ "mean 1.815981e+00 1.301976e+01 1.310776e+00 1.813587e+01 \n",
+ "std 1.035170e+00 6.967844e+00 1.618264e+00 7.105832e+00 \n",
+ "min 1.000000e+00 2.000000e+00 1.000000e+00 1.000000e+00 \n",
+ "25% 1.000000e+00 2.000000e+00 1.000000e+00 1.300000e+01 \n",
+ "50% 1.000000e+00 1.700000e+01 1.000000e+00 2.100000e+01 \n",
+ "75% 3.000000e+00 1.700000e+01 1.000000e+00 2.500000e+01 \n",
+ "max 5.000000e+00 2.000000e+01 1.100000e+01 2.800000e+01 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id \\\n",
+ "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
+ "mean 1.910063e+00 7.118518e+00 1.323704e+01 3.056176e+02 \n",
+ "std 1.220012e+00 1.016095e+01 1.631503e+01 1.155791e+02 \n",
+ "min 1.000000e+00 1.000000e+00 2.000000e+00 1.000000e+00 \n",
+ "25% 1.000000e+00 2.000000e+00 4.000000e+00 2.500000e+02 \n",
+ "50% 2.000000e+00 4.000000e+00 8.000000e+00 3.280000e+02 \n",
+ "75% 2.000000e+00 8.000000e+00 1.600000e+01 4.100000e+02 \n",
+ "max 7.000000e+00 2.410000e+02 2.410000e+02 4.600000e+02 \n",
+ "\n",
+ " created_at_ts words_count \n",
+ "count 1.112623e+06 1.112623e+06 \n",
+ "mean 1.506598e+12 2.011981e+02 \n",
+ "std 8.343066e+09 5.223881e+01 \n",
+ "min 1.166573e+12 0.000000e+00 \n",
+ "25% 1.507220e+12 1.700000e+02 \n",
+ "50% 1.507553e+12 1.970000e+02 \n",
+ "75% 1.507756e+12 2.280000e+02 \n",
+ "max 1.510666e+12 6.690000e+03 "
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trn_click.describe()"
]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click.groupby('user_id')['click_article_id'].count().min() # 训练集里面每个用户至少点击了两篇文章"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "##### 画直方图大体看一下基本的属性分布"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n",
- "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAABDAAAAWYCAYAAABArDYhAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAEAAElEQVR4nOzdd5gkZd318e8hIzmJSHBREEUUkRVQfBRBSQoYEMGEiGIC9TGC+Aqi+GAWFVAUBBSJoqxKEAVEVMKCZEQRCYuEJSdBwnn/uO9he4dJuzPbVbV7PtfV11ZXVVef6e2Zrv7VHWSbiIiIiIiIiIg2m6/pABERERERERERo0kBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiYEJLeLencnvsPSHr2KI+ZJMmSFhjnc79d0m/Hc4yIaLcUMCLiKZo8+Zgdkq6UtMkY9rOkNcbxPP8j6ZoRth8h6Uuze/yIiIi5je3FbV830ccd6rzD9tG2N5/o5xolx76SftrP54yYl/X9i0ZEdI/txZvOMEDSEcA0258bWGf7Bf14btt/BNbqx3NFRERERMTM0gIjIjpD0vxNZ4iIiIhC0qqSTpI0XdKdkr43xD5Ptn6UtKikb0i6QdK9ks6VtOgQj3mzpOslrTPC059T/72nthR92RAtSC3pQ5L+Iel+SV+U9BxJf5Z0n6TjJS3Us//rJV0i6Z66z4t6tn1G0s31ONdI2kzSlsBngbfWDJfWfXeRdHXd9zpJ7+85ziaSpkn6tKTbJd0i6Q2Stpb0d0l3Sfpsz/77SjpR0nH1eBdLWncs/z8Rc6MUMCLmcQ2ffCDpBEm31mOdI+kFPduOkHSIpFMkPQjsCrwd+HQ9UfhV3e96Sa+py/NL+qykf9YP+oskrTrE8y4s6euSbpR0m6TvD/VzDHrMJpKm9dxfr55I3C/pOGCRkR4fERExt6gXFX4N3ABMAlYGjh3lYV8H1gdeDiwLfBp4YtBxdwG+ArzG9hUjHOuV9d+lazeVvwyz3xb1OTeqz3co8A5gVWAdYKf6vOsBhwPvB5YDfgBMqecLawG7Ay+1vUQ95vW2TwO+DBxXMwwUFm4HXg8sCewCfEvSS3oyPYNyzrAy8HnghzXT+sD/AP9P0uo9+28HnFBfs58Bv5S04AivTcRcKwWMiHlYC04+AE4F1gSeDlwMHD1o+9uA/YElgKPq9q/WE4VthjjexyknI1tTThzeAzw0xH4HAM8FXgyswYyTiDGpV2x+CfyE8jqcALx5rI+PiIjouA2AZwKfsv2g7YdtnzvczpLmo3wmf9T2zbYft/1n24/07PYx4FPAJravnaCcX7V9n+0rgSuA39q+zva9lHOQ9ep+uwE/sH1+zXYk8Ail8PE4sDCwtqQFbV9v+5/DPaHt39j+p4s/AL+lFCYGPArsb/tRynnX8sCBtu+vOa8CeltZXGT7xLr/NynFj43G+8JEdFEKGBHztsZPPmwfXj+wHwH2BdaVtFTPLifb/pPtJ2w/PIaf6b3A52xfU08cLrV956CfQ5QTlf+1fZft+ylXUHYcw/EHbAQsCHzb9qO2TwQunIXHR0REdNmqwA22Hxvj/stTvngP+8Wfcv5wkO1pI+wzq27rWf7PEPcHxvl6FvCJ2n3kHkn3UH7GZ9bzmY9RzlNul3SspGcO94SStpJ0Xu0Ocg/losryPbvcafvxngxD5ewdf+ymgQXbTwDTKOdvEfOcFDAi5m2NnnzU7h4H1O4e9wHX9zzPgJue+sgRrTpKPoAVgKcBF/WcpJxW14/VM4Gbbbtn3Q2zEjQiIqLDbgJW09hnH7sDeBh4zgj7bA58TtJYWjR69F1myU2UVhFL99yeZvsYANs/s/0KSqHDlJamT8khaWHg55QWqyvaXho4BdA4sj3ZFbZeTFoF+Pc4jhfRWSlgRMzbmj75eBulX+drgKUo3Vhg5g/5wScoo52w3DRKPig/x3+AF/ScpCw1i7Ot3AKsXFtzDFhtFh4fERHRZRdQPgsPkLSYpEUkbTzczrXlwOHANyU9s17EeFn9wj/gSmBL4CBJ247y/NMpXVhHnOZ9FvwQ+ICkDVUsJul1kpaQtJakTWvWhynnEAPdZ28DJtXCAsBClO4m04HHJG1FOTcaj/Ulvamer32M0rXlvHEeM6KTUsCImLc1ffKxBOVD+E5Ki4gvjyHzbYx8svIj4IuS1qwnIC+StNwQP8cPKYNqPR1A0sqSthjD8w/4C/AY8BFJC0p6E6VLTkRExFyvdoHYhjKO1I2Ubg1vHeVhnwQup3S5vIvSimGm7yO2L6UMgPnD+uV/uOd/iDJG1p9qa8pxjQlheyrwPuB7wN3AtcC76+aFKWNn3QHcShm3a6+67YT6752SLq7dUj8CHF+P8zZgyniyASdTXtu7gXcCb6rjYUTMczRz6+eImNdIWg34DmVwKVNGt74YeG9tKokkA2vavrbO1PF/wFso/TMvpYzGvSLwL2BB249Jmgz8Bni37VOHee7FKYNybko5kfl/wJE9z3UEMM3253oesyblZGEScLbtN0i6vub9XR2YdC/KjCXLA38D3mh72qCfYxHKoJ071v1uBg6x/Z0RXqtNgJ/aXqXen0wphKxBaR4K8I/evBERERGzS9K+wBq239F0log2SAEjIiIiIiKihVLAiJhZupBERERERETrSHq7pAeGuF3ZdLaIaEZaYETEHCXp7cAPhth0g+0X9DvPaCR9FvjsEJv+aHvYvrgRERERETFnpYAREREREREREa031qkT5wnLL7+8J02a1HSMiIiITrnooovusL1C0znaJucVERERs2e4c4sUMHpMmjSJqVOnNh0jIiKiUyTd0HSGNsp5RURExOwZ7twig3hGREREREREROulgBERERERERERrddIAUPS4ZJul3RFz7qvSfqbpMsk/ULS0j3b9pJ0raRrJG3Rs37Luu5aSXv2rF9d0vl1/XGSFurbDxcRERERERERE66pFhhHAFsOWncGsI7tFwF/B/YCkLQ2sCPwgvqYgyXNL2l+4CBgK2BtYKe6L8BXgG/ZXgO4G9h1zv44ERER0VWSFpF0gaRLJV0p6QtD7LNwvShybb1IMqmBqBEREfO0RgoYts8B7hq07re2H6t3zwNWqcvbAcfafsT2v4BrgQ3q7Vrb19n+L3AssJ0kAZsCJ9bHHwm8YU7+PBEREdFpjwCb2l4XeDGwpaSNBu2zK3B3vTjyLcrFkoiIiOijto6B8R7g1Lq8MnBTz7Zpdd1w65cD7ukphgysH5Kk3SRNlTR1+vTpExQ/IiIiusLFA/XugvXmQbttR7koAuUiyWb1oklERET0SeumUZW0N/AYcHQ/ns/2ocChAJMnTx58shIRLbb/O7ZvOsJT7P3TE0ffKSJap3ZNvQhYAzjI9vmDdnnywontxyTdS7locseg4+wG7Aaw2mqrzenYMRf7wytf1XSEp3jVOX9oOkJEjNO6J57edISnuHT7LUbfqWpVCwxJ7wZeD7zd9kAx4WZg1Z7dVqnrhlt/J7C0pAUGrY+IiIgYku3Hbb+Yct6wgaR1ZvM4h9qebHvyCiusMKEZIyIi5nWtKWBI2hL4NLCt7Yd6Nk0BdqyDZ60OrAlcAFwIrFlnHFmIMtDnlFr4OAsYuDS7M3Byv36OiIiI6C7b91DOIwYPNv7khZN6kWQpykWTiIiI6JOmplE9BvgLsJakaZJ2Bb4HLAGcIekSSd8HsH0lcDxwFXAa8OF6leQxYHfgdOBq4Pi6L8BngI9LupbSvPOwPv54ERER0SGSVhiYvl3SosBrgb8N2m0K5aIIlIskZ/a0Fo2IiIg+aGQMDNs7DbF62CKD7f2B/YdYfwpwyhDrr6PMUhIRERExmpWAI+s4GPNRLor8WtJ+wFTbUyjnKT+pF0fuorT8jIiIiD5q3SCeEREREf1k+zJgvSHWf75n+WHgLf3MFRERETNrzRgYERERERERERHDSQuMudSN+72w6QhPsdrnL286QkRERERERHRUWmBEREREREREROulBUZERIzJvvvu23SEIbU1V0RERERMrLTAiIiIiIiIiIjWSwuMEaz/qaOajjCki772rqYjRERERERERPRVWmBEREREREREROulgBERERERERERrZcCRkRERERERES0XgoYEREREREREdF6KWBEREREREREROulgBERERERERERrZcCRkRERERERES03gJNB4iIZn3vE79qOsJT7P6NbZqOEBERERERLZMWGBERERERERHReilgRERERERERETrpYAREREREREREa3XWAFD0uGSbpd0Rc+6ZSWdIekf9d9l6npJ+o6kayVdJuklPY/Zue7/D0k796xfX9Ll9THfkaT+/oQRERERERERMVGaHMTzCOB7wFE96/YEfm/7AEl71vufAbYC1qy3DYFDgA0lLQvsA0wGDFwkaYrtu+s+7wPOB04BtgRO7cPPFRERLXP8CRs0HeEpdnjLBU1HiErSqpTzkRUp5xOH2j5w0D6bACcD/6qrTrK9Xx9jRkREzPMaa4Fh+xzgrkGrtwOOrMtHAm/oWX+Ui/OApSWtBGwBnGH7rlq0OAPYsm5b0vZ5tk05KXkDEREREU/1GPAJ22sDGwEflrT2EPv90faL6y3Fi4iIiD5r2xgYK9q+pS7fSrkSArAycFPPftPqupHWTxti/VNI2k3SVElTp0+fPv6fICIiIjrF9i22L67L9wNXM8x5Q0RERDSnbQWMJ9WWE+7D8xxqe7LtySussMKcfrqIiIhoMUmTgPUoXVAHe5mkSyWdKukFwzw+F0YiIiLmkCbHwBjKbZJWsn1L7QZye11/M7Bqz36r1HU3A5sMWn92Xb/KEPtHzDF/eOWrmo7wFK865w9NR4iI6AxJiwM/Bz5m+75Bmy8GnmX7AUlbA7+kjM01E9uHAocCTJ48eY5fiImIiJiXtK0FxhRgYCaRnSmDZQ2sf1edjWQj4N7a1eR0YHNJy9QZSzYHTq/b7pO0UZ195F09x4qIiIiYiaQFKcWLo22fNHi77ftsP1CXTwEWlLR8n2NGRETM0xprgSHpGErrieUlTaPMJnIAcLykXYEbgB3q7qcAWwPXAg8BuwDYvkvSF4EL63772R4YGPRDlJlOFqXMPpIZSCIiIuIp6sWOw4CrbX9zmH2eAdxm25I2oFwEurOPMSMiIuZ5jRUwbO80zKbNhtjXwIeHOc7hwOFDrJ8KrDOejBERETFP2Bh4J3C5pEvqus8CqwHY/j6wPfBBSY8B/wF2rOcnERER0SdtGwMjIiIioq9snwtolH2+B3yvP4kiIiJiKG0bAyMiIiIiIiIi4ilSwIiIiIiIiIiI1ksXkoiIPrt6/zObjvAUz99706YjRERERESMKC0wIiIiIiIiIqL1UsCIiIiIiIiIiNZLASMiIiIiIiIiWi9jYERERMRcRdKbgFcABs61/YuGI0VERMQESAuMiIiImGtIOhj4AHA5cAXwfkkHNZsqIiIiJkJaYERERMTcZFPg+bYNIOlI4MpmI0VERMRESAuMiIiImJtcC6zWc3/Vui4iIiI6Li0wIiIiYm6yBHC1pAvq/ZcCUyVNAbC9bWPJIiIiYlxmu4BRB8galu2TZvfYEREREbPp800HiIiIiDljPC0wtqn/Ph14OXBmvf9q4M9AChgRERHRV7b/ACBpSXrOc2zf1VioiIiImBCzXcCwvQuApN8Ca9u+pd5fCThiQtJFREREzAJJuwH7AQ8DTwCiTKf67CZzRURExPhNxBgYqw4UL6rbmHnwrIiIiIh++RSwju07mg4SERERE2siChi/l3Q6cEy9/1bgdxNw3IiIiIhZ9U/goaZDRERExMQbdwHD9u51QM//qasOtf2L8R43IiIiYjbsBfxZ0vnAIwMrbX+kuUgRERExESZkGtU648iEDNop6X+B91L6q14O7AKsBBwLLAdcBLzT9n8lLQwcBawP3Am81fb19Th7AbsCjwMfsX36ROSLiIiIVvsBZWDxyyljYERERMRcYr7ZfaCkc+u/90u6r+d2v6T7ZvOYKwMfASbbXgeYH9gR+ArwLdtrAHdTChPUf++u679V90PS2vVxLwC2BA6WNP/s/qwRERHRGQva/rjtH9s+cuA20gMkrSrpLElXSbpS0keH2EeSviPpWkmXSXrJnPsRIiIiYiizXcCw/Yr67xK2l+y5LWF7yYH9JC0zi4deAFhU0gLA04BbgE2BE+v2I4E31OXt6n3q9s0kqa4/1vYjtv8FXAtsMMs/ZERERHTNqZJ2k7SSpGUHbqM85jHgE7bXBjYCPlwvhvTaCliz3nYDDpnw5BERETGi2S5gzILfj3VH2zcDXwdupBQu7qV0GbnH9mN1t2nAynV5ZeCm+tjH6v7L9a4f4jEzqSc5UyVNnT59+lijRkRERDvtRB0Hg3IOcREwdaQH2L7F9sV1+X7gap563rAdcJSL84Cl69TxERER0ScTMgbGKDTmHUtrje2A1YF7gBMoXUDmGNuHAocCTJ482XPyuSIiImLOsr36eB4vaRKwHnD+oE3DXRzpnUoeSbtRWmiw2mpPnVV+/U8dNZ54c8RFX3vXmPa7cb8XzuEks261z18+6j4bf3fjPiSZNX/a409NR5ijvveJXzUd4Sl2/8Y2o+6z/zu270OSWbP3T08cfSfg6v3PnMNJZt3z99501H323XffOR9kFo0l0/EntLNx/w5vuaDpCHNcPwoYs1IUeA3wL9vTASSdBGxMucqxQG1lsQpwc93/ZmBVYFrtcrIUZTDPgfUDeh8TERHRCeue2L7xpy/dfoumI4xK0jrA2sAiA+tsj1o5kLQ48HPgY7ZnazyvXBiJiIiYc/rRhWRW3AhsJOlpdSyLzYCrgLOAgZLozsDJdXlKvU/dfqZt1/U7SlpY0uqU/qpzfzkqIiJiHidpH+C79fZq4KvAtmN43IKU4sXRdXa1wXJxJCIiomH9KGCMuQuJ7fMpg3FeTJn+bD7KVYzPAB+XdC1ljIvD6kMOA5ar6z8O7FmPcyVwPKX4cRrwYduPT8hPExEREW22PeUCyK22dwHWpbTQHFa9aHIYcLXtbw6z2xTgXXU2ko2Ae23fMsy+ERERMQdMSBcSSa8A1rT9Y0krAIvX2T+gnESMme19gH0Grb6OIWYRsf0w8JZhjrM/sP+sPHdERER03n9sPyHpMUlLArczc8uJoWwMvBO4XNIldd1ngdUAbH8fOAXYmjKz2UPALnMge0RERIxg3AWM2lRzMrAW8GNgQeCnlJMBbN813ueIiIiIGKOpkpYGfkiZgeQB4C8jPcD2uYzSYrR2Uf3wBGWMiIiI2TARLTDeSBmte2D6sX9LWmICjhsRERExS2x/qC5+X9JpwJK2L2syU0REREyMiRgD47/1qoQBJC02AceMiIiImGWSfj+wbPt625f1rouIiIjumogWGMdL+gFlqtP3Ae+hNNuMiIiI6AtJiwBPA5aXtAwzuoQsCazcWLCIiIiYMOMuYNj+uqTXAvdRxsH4vO0zxp0s5kkbf3fjpiM8xZ/2+FPTESIiYnTvBz4GPJMy9sVAAeM+4HsNZYqIiIgJNCGzkNSCRYoWERER0QjbBwIHStrD9nebzhMRERETb7bHwJB0v6T7ev69r/f+RIaMiIiIGKNbBwYTl/Q5SSdJeknToSIiImL8ZruAYXsJ20v2/Ltk7/2JDBkRERExRv/P9v2SXgG8BjgMOKThTBERETEBxj0LiaSNeqdNlbSEpA3He9yIiIiI2fB4/fd1wKG2fwMs1GCeiIiImCATMY3qIcADPfcfJFc6IiIiohk319nR3gqcImlhJuZ8JyIiIho2ER/osu2BO7afYIIGB42IiIiYRTsApwNb2L4HWBb4VKOJIiIiYkJMRAHjOkkfkbRgvX0UuG4CjhsRERExS2w/BNwOvKKuegz4R3OJIiIiYqJMRAHjA8DLgZuBacCGwG4TcNyIiIiIWSJpH+AzwF511YLAT5tLFBERERNl3F09bN8O7DgBWSIiIiLG643AesDFALb/3TvYeERERHTXbBcwJH3a9lclfRfw4O22PzKuZBERERGz7r+2LckAkhZrOlBERERMjPG0wLi6/jt1IoJERERETIDj6ywkS0t6H/Ae4IcNZ4qIiIgJMNsFDNu/qosP2T6hd5ukt4wrVURERMRssP11Sa8F7gPWAj5v+4yGY0VERMQEmIjpTvcCThjDuoiIiIg5zvYZks6nnudIWtb2XQ3HioiIiHEazxgYWwFbAytL+k7PpiUpU5bN7nGXBn4ErEMZW+M9wDXAccAk4HpgB9t3SxJwYM3xEPBu2xfX4+wMfK4e9ku2j5zdTBEREdENkt4PfAF4GHgCEOV84tlN5oqIiIjxG880qv+mjH/xMHBRz20KsMU4jnsgcJrt5wHrUsba2BP4ve01gd/X+wBbAWvW227AIVCutAD7UKZ03QDYR9Iy48gUERER3fBJYB3bk2w/2/bqtkctXkg6XNLtkq4YZvsmku6VdEm9fX7Ck0dERMSIxjMGxqX1Q36LiWrdIGkp4JXAu+tz/Bf4r6TtgE3qbkcCZ1PmeN8OOMq2gfMkLS1ppbrvGQPNRSWdAWwJHDMROSMiIqK1/klplTmrjgC+Bxw1wj5/tP362QkVERER4zeuMTBsPy5pVUkL1WLDeK0OTAd+LGldSouOjwIr2r6l7nMrsGJdXhm4qefx0+q64dY/haTdKK03WG211SbgR4iIiIgG7QX8uY6B8cjAytGmd7d9jqRJczhbREREjMNEDOL5L+BPkqYADw6stP3N2czzEmAP2+dLOpAZ3UUGjvvk3O4TwfahwKEAkydPnrDjRkRERCN+AJwJXE4ZA2MivUzSpZRutJ+0feXgHXJhJCIiYs6ZiALGP+ttPmCJcR5rGjDN9vn1/omUAsZtklayfUvtInJ73X4zsGrP41ep625mRpeTgfVnjzNbREREtN+Ctj8+B457MfAs2w9I2hr4JWUMrpnkwkhERMScM+4Chu0vTESQeqxbJd0kaS3b1wCbAVfV287AAfXfk+tDpgC7SzqWMmDnvbXIcTrw5Z6BOzenNCmNiIiIuduptRXEr5i5C8m4plG1fV/P8imSDpa0vO07xnPciIiIGLtxFzAkrQB8GngBsMjAetubzuYh9wCOlrQQcB2wC6V1x/GSdgVuAHao+55CmUL1WsqAXbvU575L0heBC+t++2X+94iIiHnCTvXf3gsX455GVdIzgNtqV9YNKOcmd47nmBERETFrJqILydHAccDrgQ9QWkhMn92D2b4EmDzEps2G2NfAh4c5zuHA4bObIyIiIrrH9uqz8zhJx1C6ny4vaRplOvYF6zG/D2wPfFDSY8B/gB3reUhERET0yUQUMJazfZikj9r+A/AHSReO+qiIiIiICSJpU9tnSnrTUNttnzTS423vNMr271GmWY2IiIiGTEQB49H67y2SXkcZmXvZCThuRERExFi9ijL7yDZDbDMwYgEjIiIi2m8iChhfkrQU8Angu8CSwP9OwHEjIiIixsT2PnVxP9v/6t0maba6lURERES7zDfeA9j+te17bV9h+9W217c9ZWC7pMz+EREREf3y8yHWndj3FBERETHhJqIFxmjeAvxfH54nIiIi5lGSnkeZEW2pQeNgLEnPLGkRERHRXf0oYKgPzxERERHztrUoM6ItzczjYNwPvK+JQBERETGx+lHAyBRjERERMUfZPhk4WdLLbP9luP0k7WU7LUMjIiI6aNxjYIxBWmBEREREX4xUvKje0pcgERERMeHGXcCQ9JQpUweN9n3CeJ8jIiIiYoLkwkpERERHTUQLjF9JWnLgjqS1gV8N3Lf95Ql4joiIiIiJkK6tERERHTURBYwvU4oYi0tan9Li4h0TcNyIiIiIiZYWGBERER017kE8bf9G0oLAb4ElgDfa/vu4k0VERETMIknL2r5r0LrVbf+r3k3X1oiIiI6a7QKGpO8yczPMpYB/ArtLwvZHxhsuIiIiYhb9StJWtu+DJ7u2Hg+sA+naGhER0WXjaYExddD9i8YTJCIiImICDHRtfR2wFnAU8PZmI0VERMREmO0Chu0jASQtBjxs+/F6f35g4YmJFxERETF26doaEREx9xr3GBjA74HXAA/U+4tSThpePgHHjoiIiBhVurZGRETM/SaigLGI7YHiBbYfkPS0CThuRERExFila2tERMRcbiIKGA9KeontiwHqVKr/mYDjRkRERIxJurZGRETM/eabgGN8DDhB0h8lnQscB+w+ngNKml/SXyX9ut5fXdL5kq6VdJykher6hev9a+v2ST3H2Kuuv0bSFuPJExEREZ3xe0p31gGLAr9rKEtERERMoHEXMGxfCDwP+CDwAeD5tsfbbPOjwNU9978CfMv2GsDdwK51/a7A3XX9t+p+A1Om7Qi8ANgSOLhegYmIiIi521O6tgKjdm2VdLik2yVdMcx2SfpOvThymaSXTGDmiIiIGIPZLmBI2rT++yZgG+C59bZNXTe7x10FeB3wo3pfwKbAiXWXI4E31OXt6n3q9s3q/tsBx9p+xPa/gGuBDWY3U0RERHTGg73FhVno2noE5aLHcLYC1qy33YBDxpExIiIiZsN4xsB4FXAmpXgxmIGTZvO43wY+TZn6DGA54B7bj9X704CV6/LKwE0Ath+TdG/df2XgvJ5j9j5mJpJ2o5yIsNpqq81m5IiIiGiJj1G6tv4bEPAM4K2jPcj2Ob1dUYewHXCUbQPnSVpa0kq2b5mAzBERETEGs13AsL1P/XeXiQoj6fXA7bYvkrTJRB13JLYPBQ4FmDx5skfZPSIiIlrM9oWSngesVVddY/vRCTj0kxdNqoGLIzMVMHJhJCIiYs6Z7QKGpI+PtN32N2fjsBsD20raGlgEWBI4EFha0gK1FcYqwM11/5uBVYFpkhagzPl+Z8/6Ab2PiYiIiLmMpE1tnzlEN9bnSsL27LYMnSW5MBIRETHnjGcQzyVGuC0+Owe0vZftVWxPogzCeabttwNnAdvX3XYGTq7LU+p96vYza9POKcCOdZaS1Sn9VS+YnUwRERHRCa+q/24zxO31E3D8XByJiIho2Hi6kHwBQNKRwEdt31PvLwN8Y0LSzfAZ4FhJXwL+ChxW1x8G/ETStcBdlKIHtq+UdDxwFfAY8OGB+eAjIiJi7jMnurYOMgXYXdKxwIbAvRn/IiIior/GM4jngBcNFC8AbN8tab3xHtT22cDZdfk6hphFxPbDwFuGefz+wP7jzRERERHtN96urZKOATYBlpc0DdgHWLA+9vvAKcDWlJnNHgLmVKEkIiIihjERBYz5JC1j+24ASctO0HEjIiIixmqJEbaNOhaF7Z1G2W7gw7MaKiIiIibORBQavgH8RdIJ9f5bSMuHiIiI6KM+d22NiIiIBoy7gGH7KElTgU3rqjfZvmq8x42IiIiYDXOka2tEREQ0b0K6etSCRYoWERER0bR0bY2IiJhL5QM9IiIi5ibp2hoRETGXSgEjIiIi5hrp2hoRETH3SgEjIiIi5irp2hoRETF3mq/pABERERERERERo0kBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9VhUwJK0q6SxJV0m6UtJH6/plJZ0h6R/132Xqekn6jqRrJV0m6SU9x9q57v8PSTs39TNFRERERERExPi1qoABPAZ8wvbawEbAhyWtDewJ/N72msDv632ArYA162034BAoBQ9gH2BDYANgn4GiR0RERMRgkraUdE29KLLnENvfLWm6pEvq7b1N5IyIiJiXtaqAYfsW2xfX5fuBq4GVge2AI+tuRwJvqMvbAUe5OA9YWtJKwBbAGbbvsn03cAawZf9+koiIiOgKSfMDB1EujKwN7FQvoAx2nO0X19uP+hoyIiIi2lXA6CVpErAecD6wou1b6qZbgRXr8srATT0Pm1bXDbc+IiIiYrANgGttX2f7v8CxlIskERER0SKtLGBIWhz4OfAx2/f1brNtwBP4XLtJmipp6vTp0yfqsBEREdEdY73w8eY65taJklYd6kA5r4iIiJhzWlfAkLQgpXhxtO2T6urbatcQ6r+31/U3A70nEKvUdcOtfwrbh9qebHvyCiusMHE/SERERMxNfgVMsv0iStfUI4faKecVERERc06rChiSBBwGXG37mz2bpgADM4nsDJzcs/5ddTaSjYB7a1eT04HNJS1TB+/cvK6LiIiIGGzUCx+277T9SL37I2D9PmWLiIiIaoGmAwyyMfBO4HJJl9R1nwUOAI6XtCtwA7BD3XYKsDVwLfAQsAuA7bskfRG4sO63n+27+vITRERERNdcCKwpaXVK4WJH4G29O0haqWc8rm0pA41HREREH7WqgGH7XEDDbN5siP0NfHiYYx0OHD5x6SIiImJuZPsxSbtTWmvODxxu+0pJ+wFTbU8BPiJpW8qU73cB724scERExDyqVQWMiIiIiCbYPoXSsrN33ed7lvcC9up3roiIiJihVWNgREREREREREQMJQWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9ubqAIWlLSddIulbSnk3niYiIiHYa7ZxB0sKSjqvbz5c0qYGYERER87S5toAhaX7gIGArYG1gJ0lrN5sqIiIi2maM5wy7AnfbXgP4FvCV/qaMiIiIubaAAWwAXGv7Otv/BY4Ftms4U0RERLTPWM4ZtgOOrMsnAptJUh8zRkREzPNku+kMc4Sk7YEtbb+33n8nsKHt3QfttxuwW727FnDNHIq0PHDHHDr2nNTV3NDd7F3NDd3N3tXc0N3syd1/czL7s2yvMIeOPceN5ZxB0hV1n2n1/j/rPncMOla/ziugu+/H5O6/rmbvam7obvau5obuZk/uoQ15brHAHHzCTrB9KHDonH4eSVNtT57TzzPRupobupu9q7mhu9m7mhu6mz25+6/L2bukX+cV0N3/0+Tuv65m72pu6G72ruaG7mZP7lkzN3chuRlYtef+KnVdRERERK+xnDM8uY+kBYClgDv7ki4iIiKAubuAcSGwpqTVJS0E7AhMaThTREREtM9YzhmmADvX5e2BMz239sONiIhoqbm2C4ntxyTtDpwOzA8cbvvKBiP1pTnpHNDV3NDd7F3NDd3N3tXc0N3syd1/Xc4+Rw13ziBpP2Cq7SnAYcBPJF0L3EUpcjStq/+nyd1/Xc3e1dzQ3exdzQ3dzZ7cs2CuHcQzIiIiIiIiIuYec3MXkoiIiIiIiIiYS6SAERERERERERGtlwJGRERERERERLReChhzmKRlJS3bdI6ImKFOgTiwvLikyfk9jbHI+yQiIiKiOSlgzAGSVpN0rKTpwPnABZJur+smNRwvYp4m6d3AbZL+Lmkr4DLgK8ClknZqNNwskLSGpDdLWrvpLOMhafGmMwxH0saSrpZ0paQNJZ0BXCjpJkkvazrfSCS9p2d5FUm/l3SPpD9Lem6T2SIiIiJmVwoYc8ZxwC+AZ9he0/YawErAL4Fjmww2VpJWkLSepBe1+QvGYF3MLWk+SfPV5YUkvaSLV3k79IX6E8BawBaU39XX2t4MmAzs1WSwkUg6S9LydfmdwCnAVsBxkvZoNNz4XNV0gBF8C9gBeC/wG+ALtp8DbAd8vclgY7B7z/I3Ke/1ZYGvAYc0kigmRG0x9kZJ20p6XtN5RlNbon5e0ntV7C3p15K+JmmZpvPNKklnNp1hLOrFtKXr8iRJ20tap+FYI6rv62Xr8gqSjpJ0uaTjJK3SdL7RSHq1pO9JOlnSSZIOkLRG07lmhaTVJb2pI39bdu85L1pD0jm1UH++pBc2nW8kkraQtOvgC9u9Fx/aqF7MWbIuLyrpC5J+JekrkpbqV44UMOaM5W0fZ/vxgRW2H7d9LLBcg7lGJWltSb8D/kJpPfJD4HJJR/TzjTmrOpz7DcAtwM2StgP+SPmCcZmkbZrMNpoOf6F+3PYdtv8FPGD7nwC2b2s412hWsH1HXf4I8DLb7wU2BN7XXKzRSfr4MLdPAG0uNC5o+3LbfwGm2z4XwPbFwKLNRpslz7V9qO0nbP+CUsiIjpH0KklTgQOAw4HdgMMknS1p1WbTjeinwGLA+sBZwDMord7+AxzRXKzRSbps0O1yYOOB+03nG46kPYE/AOdJei9wGjM+nz/eaLiR7W/7rrr8PeCvlNynAj9uLNUYSPo/4F3AecCjwD/r7QRJb2ky20gk/bJneTvgTGAb4GSVFqtt9sGe86IDgW/ZXhr4DPD9xlKNQtKXgb2BFwK/H3TOvPvQj2qNw4GH6vKBwFKUv+cP0cff0QVG3yVmw0WSDgaOBG6q61YFdqb8MW6zw4GdbV8jaQPgw7Y3lPQ+4DBg+2bjDaurufcB1qV8GboUeGn9GZ4F/Bz4VZPhRjHUF+o7JT2N8gH+3eaijejGeqKxBPA3Sd8ATgJeQykmtdWjkla2fTPwAPBgXf8IMH9zscbky5TC3GNDbGtzIb032+DWOQv1M8hsWEXSdwABK0ha0PajdduCDeaK2fdtYHPb0yWtDnzT9saSXkv5nNu80XTDe6btrSUJmGZ7k7r+j5IuaS7WmFwP3Ad8iVJwEeVCQ6svMADvBNYGnkb5GZ5d3zeLUS7yfLPBbCPp/Sxbw/Zb6/IRkj7WQJ5Z8XrbLwSQdCzwB9ufknQi5T1zQqPphvesnuXPAJva/le9QPV72l1k7P0e+/RaoMf22ZKWaCjTWGwDrGf7MUn7Aj+T9Gzb/0v5G9Nm89keOJebbPsldfncfv49b/OJY5e9C7gc+AJwer3tC1xB+VBps0VtXwNg+wJKdRDbPwRe0GSwUXQ1N7Zvra0Bbuz5GW6g/b+fj0pauS536Qv1OygnpNOAbSmtdvYCVgTe3VysUf0v8FtJ+wFXAmdK2odyZa3VV6aAi4Ff2v7C4Btwf9PhRvD/akEO278cWCnpOcBRTYUao08BFwFTgc9SW7pIegYwpcFcMfvmtz29Lt9I/eJh+wxg5WEf1bz5aleRVYHFB5pMS1qOlhcCbW9LuZhwKLCu7euBR23fUD+n2+px2/8B7qEUXu4EsP3gSA9qgbMl7Sdp0br8RihdM4B7m402qic0o/vvM6nnQbbvpt1fSt2zvEA9H6VeoHqimUhjdmJtaf1s4BeSPibpWZJ2ofyNbKsFBooAtu+hFDSWlHQCLf+bCFxRX18oY8dNBlAZW+vR4R82sWR79L1iniHpJEorkTOBNwHL2H6PpAWBK2yv1WjAYXQ491+B9W0/IWmDWnxB0vzApbZb219V0ibAQZSTu2WBl1CKda8ATrfd9jECOqd2h3ob8FzKlYdpwMm2/9ZosFFIWgu4q+fLV++2FTvQfSeicZIOp3zZOJNSfL3Z9sdrke1i263ss64yOPK3690PAR+k/BxrU8aVObShaGNWWy58EXgO5TO71eMxSDqC8kVoMUrT7scoxe5NgSVs79BcuuHVc7a9gYFxAFahXBz5FbCn7dZ+KZX0VuCrwN8p42x90PZvJK0AHGj7bY0GHIakxymvsYCFgWfZvkXSQsBU2y9qNOAoajeXD1J+NxemtHz/JfAV260sekn6NfA1238YtP5LwGdtt/YCZj0PPRD4H+AOyrn/TfX2EduX9iVHChj9JenztvdrOsdwVAZ8+izlxOJS4ADb99c37PNtn9dkvuF0OPdLgcttPzxo/STgFbZ/2kiwMeriF2qVAVN3Bt5MuSL4OOWE45DBHyYxb6uFxPdSTqJPs/2nnm2fs/2lxsLNBkl/t50ZSDqqfrl7HzM+5w63/Xi9Wv30NrcIqL9Lqk2mFwBeTCnAtLnb3lNIWpfSXbK1/evhyanC30IpFJ0IbED5rL4ROKgDLTEGzi8WsH1n01nGqrbAeDZwbb2y3ln1vPr5dQyomED1bza1ldTgbQNdhVtNZSDP1ann/v2+EJUCRp9JutH2ak3nmBdIerrt25vOEe0i6cfADcDvKGOj3Efpn/oZSvGlrWN3DEvSobZ3azrHcLpaCJD0I0of8gso3f/+YPvjddvFPX0/W0fS/ZQvL71Nl59GuRpr20s2EizmWbWp8ZNF4zYXukci6cu2P9t0jnlF117vLr/Pa1evx23f13SWsZC0LfDbwRcB2662bnnU9Ut47SL1EuAq26c2Gm42SFrWMwbf7c9zpoAx8SQN94svylgNrR08tVa896JME7gi5QT4duBkSquGe5pLNzw9ddpRUfp/r0d5n/f1F2usagVzL8oXu1Nt/6xn28G2P9RYuFHUbjsnUcY2eKDpPGMl6bLeJpGSzrO9kaSFgUtsP7/BeMMa4j3+5CZKd6PWNmnuaiGg971Sr2geDCwP7AScZ3u9JvONpA7guTTwqYErI5L+ZXv1RoPFbOvq54WkVwHfoIzHsD7wJ2AZSn/pd9q+afhHN6v+Hg32LuoYOLY/0t9EY9Ph98rg11uUz4xWv97Q3fe5pGdSZjbajjJW0sDV/8Mps8L0bVyDWSXpP5TuL6cCx1C6Lz8+8qOaJ+lSYBPbd0v6FPBGykx+r6J02xk8aHhr9F50krQ2pbvOgpTf1bfaPr8fOVrbx6bj7gHWtL3koNsStHuWA4DjgbuBV9te1vZywKvruuMbTTayOygFi4HbVMqgZhfX5bb6MeWX/ufAjpJ+Xr9IA2zUXKwx2RB4A2VWj+NV5m9v++BDUAYffQ6ApJcA/wWw/QgzD2bVNtMp7+XB7/OpwNMbzDUWG9h+m+1vU943i0s6qb7X2zy42ZPvZ9uP1VYul1DGIGjz9K8DJ/oHAsdI+kjtOtXm93eMrqufF98GtrL9GspVxkdtbwzsT5k9pc3eSBnjqfdv76M9y23V1ffK4Nd7Kt14vaG77/OfUrqjLUXpdvRz4PmUrgEHNRlsDP4GrAmcA3wC+Lek79diUpvNXwd3BXgrsFktCmwFvK65WGPypp7lrwEfrRdGdgC+1a8QKWDMGUcx87REvX42zPq2mGT7K7ZvHVjhMkvGVxj+Z2qDTwHXANvaXr3+Mk2ry89uONtInmN7T9u/dBnt/GLK7BLLNR1sDG63vT0wiTLA1vuAmyX9WFJbp/OD8l45S9I/KB/UnwKoA239uslgo7iOUrFfvef27Ppeb/sgmF0tBEyVtGXvijqG0Y8p7/tWs30RZXpggD8AizQYJ8avq58XXZ09Bcp4I3cAWwJn2D4SuN/2kXW5rbr6Xunq6w3dfZ8vZ/tsANsnAa+0/aDtzwGvbDTZ6Gz7bts/tL0ZsC5wFXCApFa2eKnukzQwSP8dzPhsXoBufTd/5kCXF5dJCBbt1xO3titDl9Vf+uG2faafWWbDDZI+DRzZ0+x4YHrJ1v4xsP0NSccB36p/tPahG1cbF5Y0n+0nAGzvL+lmSjW5zV/soL6+ta/kT4Cf1JOjtwB7Ar9tMNuwbJ8p6VmUD+07etZPBz7dXLJRfZvSHHWoUdi/2t8os2yqpC1tnzawwvZ+kv4NHNJgrhHZfscw638E/KjPcWZL/dvyHZXp2Vrb5SXGpKufF1MlHcaM2VPOBqizp7R5ym1s3w98TNL6wNGSfkM3vmB08r3S4dcbuvs+ny7pHcBZlKvr1wNIEu1/7WdqwVkvvn6H8pnX5ouuH6C8vy+ldNOfKukc4IXAlxtNNrpnS5pCee1XkfQ02w/VbQv2K0TGwJiDJC04uO+YpOV7vzS1TR3AZ09mjIEBcCswhTIlUSvHkuhVB/X5LKU1yTOazjMSSV+lDED0u0HrtwS+a3vNZpKNTtI5tttenX8KSatRWo88XD+g300dPAn4oevc3BHwZF/yFWz/c9D6F9m+rKFYY9Ll7PFUXf28UIdnT+lVPy8+RJmFZMjiZlt09b3Sq0uvN3T3fV7Pib5OyX0JZdykW+oFqU1s/7zJfCORtMlA65GuURncfHNmnsXvdLd0rMEBQ3TPucj2A/Vi9/a2+9LtKAWMOaCOJvsTSpOgi4HdbF9ft7V2wLq5Sf3AeI7tK5rOEu0i6QrKmAwPSfoKZe7wXwKbAth+zwgPbyVJr63NVFuri1+mJe1AaflyO+XKwrttX1i3tfpveZezR0TMDSQt5w5NAxvRFW1vGtRVXwW2sL08cChwhqSBQZPaPGAdAJKeLemTkg6U9E1JH6hfPlqrDlK36sB92//pQvFC0oYDr62kRSV9QdKvJH1FZUaYTpL02qYzjGC+nuZurwF2sP3TWrhYv8Fc49HmAcIGvkz/Dfi5pCslvbRn8xHNpBqTzwLr234xsAulm9Qb67a2/y3vcvYYgqRlJX1e0ntV7C3p15K+VltPdo6kzk0ZOEDS5U1nGEkXz+VG0oHX+wBJy9flyZKuA86XdMMQV61bQ9ICkt4v6VRJl9XbqfX90rcuAbND0qqSjpX0R0mf7c0r6ZcNRpttbf+bKGn3nvf5GpLOkXSPpPM1Y1yPOS5jYMwZC9m+EsD2iZKuBk6S9BlaPi6DpI8A21AGfHsp8FfKfNbnSfpQi5tqfRHYU9I/KVMpndAzmFKbHU4ZdAjKjAEPAV8BNqMMFPimYR7XdocBqzUdYhg3SdrU9pmUvp6rUsZ+afXgZip9DofcBLQ6OzO+TN8iaQPKl+m9bP+Cdn+Znt/2LVAGqKqt635di6Wt/ltOt7PH0H4KXE4ptL6jLn8FeC2lELhdY8lGoDLb05CbgBf3McoskzTcZ7CA1nZRlfRR4PV07Fyuq6939Trbe9blr1GmlLxQ0nMpA/hPbi7aiH5CmT3xC5RuDFCm392Z8jfnrc3EGpPDKYOxnwfsCvxB0ja11Utrx8Do8t9E4IO2v1eXDwS+ZfsXkjYBfgBs3I8QKWDMGY9KekYdTAbbV0rajDLDwXOajTaq9wEvrv32vgmcYnsTST8ATqa9g8BdRzmpew3lj+0XJF1EKWacVAeGaqP5esZcmNzTrPtcSZc0lGlMOvyF+r3AUZL2Be4FLqmv9dLAx5uLNar/oXxpeWDQegEb9D/OLOnql+n7JT1noNtLLcC8GvgF8IJmo42qy9ljaM+0vbUkUWbZ2qSu/2PLPy8upHyRHqpYuXR/o8yy44CjGfrvVJtn9Xkv3TyX6+rrDbCApAXqOd2iA132bP9dM6awbaP1bT930LpplGLX35sINAtWsP39uryHymCk56iMhdfmc4su/03srR08vV6IwvbZkpZoIkRMnD0pA2D2TkU6rVanPtxQplmxAPA4sDB1tGrbN7a8KZnraNu/BX5bs24F7EQZnGiFJsON4ApJu9j+MXCppMm2p9aK/aOjPbhhnfxCbfsm4NWSnk8ZPOkIyof1hQMjtrfUecBDtv8weIOkaxrIMyu6+mX6gzx1lPP7VAbC26GZSGPW5ewxtPlqV5ElgMUlTbJ9fW09ttAoj23S1cD7bf9j8Aa1e6pDgMuArw/VJVXSa4bYv026eC7X5df7YOAUSQcAp0k6EDiJMr7WJU0GG8Vdkt4C/HzgHEjSfJQZ5e5uNNnoFpS0iO2HAWz/VNKtwOnAYs1GG1GX/yaeKOkIYD/gF5I+RjmX25ShZ8mbI1LAmAMGj/oMMw3ks38DkWbFj4ALJZ1P+YL6FQBJKwBtnoFk8In6o5SZU6aoTGHVVu8FDpT0Ocpc0H+pf7xuqtvarMtfqLF9NeVDBEnbtrx4ge2tRtjW9tlgOvll2valvfdr//E1getsH91MqrHpcvYY1v9RxpIBeA/wI0mmzB7whcZSjW5fhh9zbY8+5pgdHwPuG2bbG4dZ3wZdPZf7GN18vbH9XZVBwj/AjJklnksZJPxLDUYbzY6U98fBkgYKFktTplXdsalQY/QjYENKawagfAerBZk2Ty+/Lx39m2h7b0nvprRwfw6lQLob5X3+9n7lyCwkc0Ctvn7d9h2SJgPHA09QRoJ/11Bf+NpE0guA5wNX2P7baPu3gaTn2m57U7dh1S8Xq1OnUrJ9W8OR5lrD9LE9mDJdG7ZP6m+iec+gL9OtvcIj6afAx+rf8i2AHwJ/p2T/pO0TGg04gi5nj+GpTL0n249JWoDSX/rmgS5aEQO6eC4XzRoYCywzp0TbpYAxB0i63PYL6/JZwKd7B/Kx3daBfIAnm45h+wlJCwHrANfbbnPV/inqQFUHN51jNCpzcN9n+x5JkygDPf2tC7OoDCZp2ba/TyQ9SmleeDszWgVsD5xI6YrUxWlUn/yb00Zd/TI96G/5n4G31Sb7ywO/t73uyEdoTpezx9DU4imHRyJpQ+Dq2upqUUo325cAVwFftn1vowFHoDJzzx9s31VbL3yDMn7EVcAnbE8b8QANmlvO5QZI+rzt/ZrOMRxJywK7AzdTBpfcC3g5paXnl1terO/cNOcAkpa3fUfP/XdQujBfAfzQLf6SK+nZlIH6V6V09fo75TvicC2QWknS6tS/if0slGYa1TljgXplBAYN5ENpatNakt4A3ALcLGk74I+U0ZQvk7RNk9lGIunjg26fAPYbuN90vuFI2pPS9O08Se8FTqOM3XFcm3MDSNpY0tUq02JuKOkMSpPVmyS9rOl8I3g5sChlzItdbO8C3FGXW1u8kPSmYW5vpv2js6/bc5KxD/BK26+hDLz7ueZijWo+zZh28Alq/876s7S9C2aXs8fQ/irpH5K+KGntpsPMgsMpM2xBGbV+KUqT9Ycos2212f49X/i/R5nNYyvgVFqcvavncqNoe7fan1LGXZhM6X6xEuV9/h9aPF24ujvNOZRx7wCoXbHfCVxEmZnpm02FGo3KjI8/oAxM+1LKd8OBWYI2aS7Z6NQzPW3923ImZfbKKbVrSV/kJGbO6OpAPlC+XKxL+YJ3KfBS29dIehZlqqJfNRluBF8ATgGuZMZV9fkpg5212Tsp/ZefRpnS89m2p0taDDifFv8BBr5FGb9gceA3wBtsn6syPdR36dNUSrOqtoZ6LWXE6rOA1k9vXHV5dPb5JC1ZryzM9GW6p9jbRl8AzpJ0EPAn4ASV2XdeTSk2tlmXs8fQLqN8ZuxEOVl8kNIP+Vjb1zcZbBSdnW2Lch4xYA3bA1NKHqEyeF1bdfJcTtJwV59F+VnarKuzBHV1mnOYOd+bgP+x/aCknwEXN5RpLLo64yPMPD3tZ4BNbf9roHUnfSp6tfnEsbPqQD6XUwauW5My9sWalDdlmwfyAcB1+ldJN9q+pq67YaA5Yku9gNK0czHgC7YfkrSz7TYPbAbwuO3/SPovpUp/J0D9A9xsstEtaPtyAEnTbZ8LYPvi2ky4teqAnQdKOgH4dsNxxqrLo7N38su07eMl/ZVy5W9gULaNgGNsn95ouFF0OXsMy/X3f29g7/plY0dKIeBG2y9vNt6wujzb1tmS9qMMoHq2pDfa/oXKLEqt7foCnT2Xu4dSbHnKOGBq/+wMXZ0lqKvTnAMsKmk9So+C+W0/CGUgf0mPNxttVF2cJQhmfk8sYPtf8OQFqb4Nhp8CxpxzIzAVuI3yBr2GcuLY9g9rJM1Xv+C9p2fd/LT4D7DtG4G31OZMZ0j6VtOZxujiWilejFK5PFLSaZTWOlc1mmx0vSdBew3a1tr3Si/b/6bFs2AM8jG6Ozp7Z79Mu0xz9pmmc8yOLmePIQ2eyecC4ILaZbLNMxF1ebat3SkFo4GZtf63tnz5FaU1TGt18VwOOIpyhXeogcx/1ucss2qoWYKgDKTa5otpXZ3mHEo3qYGWyndJWqnmXw54bITHNa2rswQBrFtbSglYuOc1X4iZW6zNURnEcw6Q9FHgdcA5wNaUPpP3UL5kfMj22Y2FG0Xt+3a565zKPesnAa+w/dNGgs2C2v1iX2BDt3x6ydp8/i2UiuaJlOmgdqIUwA4aqCa3kaRtgd/ZfmjQ+ucAb7bdyimsJD2D0rz2CeDzlCmr3kQ58fioM5p/VCpTMO9O+f38LvBW4M2U98p+th9oMN6Iupw9hibpbbbb/iVuWOr4bFuSlqJccWz9DA1zw7lcF6mDswRJWhd4qBa8e9cvCOzgDk67Xf8fFh58ftommstmCZK0NPB823/py/OlgDHxaveRgb5NT2NG36bVgJNtt7lvUzRM0nJdOEHqqtrC5TeUVi9vo4wr8TPgDcBrbG/XXLrhDfGFdEdmFF5a/YW0q1+mJR1PuUq8KLAWZTT544BtgWfYbu0V2C5nj7mfOjBjFUC9qvjowGwG9cr0Sygj7p/aaLhZJOkltts8LsCIJD2v7V/0apFrS2Dluupm4HTb9zQWajZI2tb2lKZzjJWkyfTM5tH29wl0fwZCSSvS8z7vd0E6BYw5oBYwJtt+pPaHO8N16lRJV9hep9mEw6tXSPYCVgFO7b3aI+lg2x9qLNwIOpz7AMq4BnfUP8DHU/4ALwS8y/YfGg04gpr3a5QP6L0oI81vQJkK6n22L2ku3fAk/XWgiFj7Bq/Ws+0S2y9uLNwIuvyFtKvZB94PKu2AbwFWsu16/1LbL2o44rC6nD2GJul5lMGTnwA+Avw/SuH178DOtq9uLt3wJH3O9pfq8trALyljgwl4q+3zG4w3IkmXApvYvlvSpygtaU8BXgVMtT24+2QrqAymPdMqyjhs21DO/TtXyBj8ed02kt5Fad35W8p5EZRz0tdSxmY7qqlsI5H0psGrgIOADwHYPqnvocZI0qso49/dQ5nV7E/AMpSxdd5pu5XjpqjMQPh+4BHg68AnKdk3Ag6z3doB/OuYI4dQZpPqfZ/fA3zQ9l/7kSNjYMwZXe7b9GPgH5RRqt+jMkXj22w/QvnFaquu5n6d7T3r8tcoJ3MX1sHNfkapyLbVwZQP66WBPwP/a/u1kjaj/HFr61SqvWN3DD6h6Fv/vdnwXNs79HwhfU39QnouZZT5NutydmrWUwauwtb7naj+dzl7PMWhlM+JxSlT130G2AV4PWWKz82aizaiNzFjAPOvUbrqnaoyCOm3KVNbt9X8tu+uy2+lzHLwn3rx4WKeOv5TW0wFzqN8QRqwHGW8AFPG2WodSd8ZbhPlXKPN9qbM5nFP78p6IfN8nnq+0RbHAacDtzNjnJ3FKMUuU2ZRbKtvA5u7zN63OvBN2xurzDR3GLB5o+mG1+UZCH8MvH9w4VnSRpQZSNbtR4g2j0TcWbYPpIxjcDplaskf1/XT2z4mA/Ac23va/qXtbSkf0GfWAXHarKu5F9CMaSQXtX0hgO2/U0YmbrMFbZ9q+xjK96ITKQu/p93Tep4saWDE588NrJS0BjMGamut+kV0pi+ktH+kcKCT2af2vFd6B8J7DnB/Y6nGpsvZY2hL2P5V/Zv7qO1jXfyKctWxC5450PXCZRDSVs9YBdwnaaDV7B3M+GxbgHafQ7+FchX6q7ZfbfvVwK11uZXFi2oX4ArgokG3qcB/G8w1FmLoz7MnaPd0pC+n/B5eaHsX27sAd9Tl94zy2KbNb3t6Xb6ROsWn7TOY0b2hjR63/R9Kq4WZZiBsMtQYLTZUqznb51EKX32RFhhziO0rgSubzjEbFu4ZuRrb+0u6mTIg6eLNRhtRV3MfDJxSr+acJulASrV7U+CSJoONwcOSNqc0I7OkN9j+ZW3S19rpq2x/XtLzJK0MnO86/oLtayX9qOF4I5kqaXHbD3TwC2kns9t+r6QNJLm2jFqb0r/5GkrrutbqcvYYVm8LscFX6No8s8SzVaZNFrCKpKd5xuB6bZ8y8APA0bUrye2Uv2XnAC8EvtxoshHY/rmk04EvSnoP8AnaXSwecCFlUMM/D94gad/+x5kl+1NmlvstpcskwGqULiRfbCzVKOrnw2uBPSSdRWnZ1YX3CpTfx8MoLdK2Bc6GJ8fdanOL2i7PQHiqpN9QWhQNvM9XBd4FnNavEBkDI2Yi6avAb23/btD6LYHv2l6zmWQj62puAEmbAB9kxvSSN1H6CB9uu7XTQKmMXP1VytWF/6X8DDtT+sS9b6gTkDaQtAdlQMmrKSOEf9T2yXXbxbYH9x1ujdrkeqgvpE+2amirLmaXtA+wFeX38gzKLEFnUU5IT7e9f4PxRtTl7DE0Se8HjvagQW9r67HdbX+skWCjqEXtXhfZfqAOAre97YOayDVWKjMabM6Mz+hpdGhgxtpn/ZvAOrZXaDrPSCQtCzzsFs8eMZLaXWQLnjqI593DP6o96oWdb1HG8Xt203lGozJTyvso3TEupZw3Py5pUeDptm9oNOAw9NQZCDegDCrf+hkIASRtBWzHzO/zKbZP6VuGlp43RgtJ2mWgO0yXJHf/tTm7yiC7L6sn0JMoHx4/sX2gegb4bJsufyHtavb6XnkxpTvXrcAqtu+rJ0fnu8UDYXY5e0RMrDr+0BK272s6S0TEeLW5/160zxeaDjCbkrv/2px9vp5uI9cDmwBbSfom7e6nuj2wMfBK4MOU8XW+SLna89Ymg41BV7M/ZvvxejXwnwMn/7Xv6hPNRhtVl7PHECS9sV6hRtIKko6SdLmk4ySt0nS+4Uh6Uc/ygpI+J2mKpC/Xpt6tJWlJSQdI+omknQZtO7ipXLOqtnKb2nSO0UhaXNJ+kq6UdK+k6ZLOk/TuprONRy0ot5Kk+SS9R9JvJF0q6WJJx9bWwa1Wfz//r/5+vm3Qttb+ftbX+HO1G+1cQ9Kh/XqujIERM5F02XCbgBX7mWVWJHf/dTj7bZJe7DrNa22J8XrKNLAvbDTZyB6z/TjwkKSZvpBKavsX0q5m/29Pf/31B1ZKWor2FwG6nD2Gtr/ttevy9yizTHwWeA1lZPjXNhVsFEcAA13zDqDMhvENyhSw36f0nW6rwTOcbU8HZjiTdD8zxjEYKMw/bWC97SWbSTaqo4FfUIrbO1DGCDgW+Jyk59r+bJPhRqKnTkf65CbgGf3MMosOA24A/o9yseE+4I+U1/yFtr/bZLhRdHUGwmUos+qcJelW4BjgONv/bjTVGAwU0YfaBGzdtxzpQhK9JN1G+eAY3F9PwJ9tP7P/qUaX3P3X1ez1SuVjtm8dYtvGtv/UQKxRqUzL/GrbD6lnwNr6hfSslo/d0cnskhauJ0KD1y8PrGS7zVfVOps9hibpGttr1eWLbPcWpi6x/eLGwo2gt2uepEuAl9p+tHZruLTN3ZkGv66S9qacpG8LnNHiv13foXxB+pTt2+q6f9levdFgo5B0qe11e+5faPulkuYDrrL9vAbjjUjSo5QCzFBfrLa3vUSfI42JpMt6fwclnWd7I0kLA5fYfn6D8UbU4d/PJ8dbk/Q/lJkr30QZm+0Y231ryTCrJD1OKXj1tlh2vb+y7b4MKJ0WGDHYr4HFB65O95J0dt/TjF1y918ns9ueNsK2VhYvqlcOfCEdKABUC1IGT22zTmYfqgBQ199BmVKxtbqcPYZ1tqT9KFdKz5b0Rtu/kPRq4N6Gs41kKUlvpHRbXtj2o1CaAUhq+1W0Ts5wZvsjktYHjpH0S0qLnba/1gAPSnqF7XMlbQvcBeVzoxa82uwy4Ou2rxi8QdJrGsgzVo9Keo7tf0p6CXW6WtuP5PdzzrP9R+CPKgPMv5bSrba1BQzgOmAz2zcO3iDppiH2nyPSAiMiIiKi5VRG3N8bGJiKeBXgQeBXwJ5DnVC2gaTBAzrvafs2Sc+gzKqyWRO5xkIdnuEMyvgGlFm33gI8p60tIweojJfyI2BN4ErgPbb/LmkFYCfb32k04AjqlfQbhvliN9l2K8cgkbQppZvXfylTj+5o+/z6mn/K9qebzDeSrv5+SjrW9o5N55gdkj4MnGv70iG27dGvLkcpYERERER0SO1+tYDtO5vOEu0naSVgPfdxmsPojtq6ZbnaQi+i9dKFJCIiIqIDJL0SuM32NZI2lvQy4Grbv2k620gkbUDpNXKhpLWBLYG/de0LtaRXABsAV9j+bdN5RiLpecB2wMp11c11HIyrG4w129Ti6dmhzBIE/MH2XbX1wjeA9YCrgE+M1H21BdYCtpP05HsFmNL290odUHJ34N+UwUg/C7yMMpbEl20PHqOtNSRtQRnIuPc1P9n2aY2Fmk2SjrLd18GY0wIjIvpO0p9tv3wW9t8E+KTt18+xUBERLSbp25QvzwsApwObAacCrwL+avtTzaUbnqR9gK0ouc8ANgTOovT3Pt32/g3GG5GkC2xvUJffR5kG+hfA5sCvbB/QZL7hSPoMZWDAY4GBL86rADsCx7Y190gk3Wh7taZzDEfSVQOzBEk6jjJL0AmUWYLebruVswR1+b0i6RTgcmBJ4Pl1+XjK35Z1bW/XYLxh1b/lzwWOYubX/F3AP2x/tKFoo5I0ZfAq4NXAmQC2t+1LjhQwIqLtUsCIiHmdpCuBdYBFKVfrVq4z+yxIKWCs02jAYUi6HHgxsDBwK7CK7fskLQqc3/JZSHpnULkQ2Nr2dEmLAefZbuXU25L+DrxgYMDUnvULAVe2eGyAkaZnf67thfuZZ1Z0eJagTr5XYMbrWrvATLO98uBtzaUbnqS/237uEOsF/L3lr/nFlFZFP2LG7CPHUApe2P5DP3LM148niYjoJemB+u8mks6WdKKkv0k6emCkcUlb1nUXU6aXGnjsYpIOl3SBpL9K2q6uP1DS5+vyFpLOqQOYRUTMDexy1WlgJp+BK1BP0O7zucdsP277IeCftu8DsP0fZvwsbTWfpGUkLUe56DcdwPaDwGPNRhvRE8BQA3auRLtf8xUpV6G3GeLW9vFezpa0Xy3MnV27lNCBWYK6+l6B+vsJrAosLmkSQP197ct0nrPpYUkvHWL9S4GH+x1mFk0GLqIMKH2v7bOB/9j+Q7+KF5AxMCKieesBL6D0YfwTsLGkqcAPgU2Ba4HjevbfGzjT9nskLQ1cIOl3wF7AhZL+CHyHcqWs7R++ERFj9Zv6920RytWv4yWdR+lCck6jyUb2X0lPqwWM3qvSS9H+L0hLUU7WBVjSSrZvkbR4XddWHwN+L+kfwMDUhqsBa1DGDGirTk7PXu1OOT+5pt7/X0kDswS9s7FUo/sY3XyvQJlS+m91+T3Aj+rUr2sDX2gs1ejeDRwiaQlmdCFZlVLoendDmcaknld/S9IJ9d/baKCekC4kEdF3kh6wvXjtGrL3QN9QSYdQihhXAN+x/cq6fltgN9uvr8WNRZhx9WtZYAvbV0t6OeVE/n/7NZVTRES/1EE7bfs8Sc8B3gjcCJzY1oKtpIVtPzLE+uWBlWxf3kCscZH0NGBF2/9qOstwagvEDZh5kMALbT/eXKp5Q9dmCerye0XS/JTvs49JWoDSXe1m27c0m2x0KlNJP/ma2761yTyzQ9LrgI1tf7afz5sWGBHRtN4T28cZ/e+SgDfbvmaIbS+kNDNt9Vz3ERGzw/Zfepb/Kelw23c1mWk0wxQvlq1TNnZy2sY69sj0pnOMpBa0zms6x7xC0otsXwZgu81dRp5i8HtF0odsd+K9M6jIsgilVdd/GoozJnV8kUdrweLW2s1oE0lXdmEWEkmrAffZvge4ElhU0jq2r+hXhjb3mYyIedffgEn1CiOUEbIHnA7s0TNWxsAAa88CPkHpkrKVpA37mDciYo5SmTb1aklXStpQ0hmUbnM31ZYZrSTpcz3La9dBAy+SdH3H/05f1XSA4Uh6kaTz6nvj0DpOwMC2C5rMNpKu5q7+Kukfkr6oMlVwJ0j6+OAbsF/PcmtJOrhn+RWU38lvAJdL2rqxYKO7EFgaQNKngP0pgzN/QtL/NZhrVJL2BP4AnCfpvcBplFmmjuvn+yUtMCKidWw/LGk3Sp/vh4A/AkvUzV8Evg1cVps9/kvSNpQ5wD9p+9+SdgWOkPRS220fECkiYiy+BewALA78BniD7XMlvQT4LrBxk+FG8CbgS3X5a8BHbZ8qaQPK3/IxT6ndbyOckIvy/9BWBwP7Uq6qvxc4V9K2tv8JLNhksFF0NTfAZZSxLnYCptTxL46hTEV6fZPBRvEF4BTKlfSBcV3mZ8Y5V5tt1LP8RcrfxIslPZsyneopzcQa1fy2767LbwX+x/Z/JB0AXEwZ062t3kkZY+RpwPXAs3tmZjof+GY/QqSAERF9Z3vx+u/ZwNk963fvWT4NeN4Qj/0P8P4hDvuann0uonQniYiYWyw4MF6EpOm2zwWoJ+yLNhttzJ5p+1QA2xd0IPeXKUWXoWYcaXMr5iV6mqJ/XdJFwGmS3smM2WvaqKu5oYxNcwVlIM+9a4FuR0oR5kbbbS3UvYDSamEx4Au1e9TOtts8COZQlrR9MYDt69TuWeju6+lycQel68t/KN/L25wb4PFabPkvJfOdUGZmqg2j+yIFjIiIiIj26z2xHXyFrs1TBj5b0hTK1d1VemYkgfZfVb8Y+GUtis+kNp9uLUlLDYzFYPssSW8Gfk4Z+Lq1upqbQbPS2L6AMkvaJ4BXNhNpdLZvBN6iMiX9GZK+1XSmWfA8SZdRXvtJkpaxfXctXrT5b+IHgKMlXQrcDkyVdA7lwtuXG002uosl/YxS8Po9cKSk0yizBvatW11mIYmIiIhouTob0+96vvwPrH8OZWDjrzaTbGSSXjVo1cW275e0IrC97YOayDUWktYC7rL9lAE7Ja1o+7YGYo1K0tuA6wYPxFgH3/t/tt/XTLKRdTU3lOy2f9Z0jvGo3QD2BTYcmAWuzerYZ71usf3fOsPRK22f1ESusaizp2wOPJfSoGAacHodGLO16kwvb6G0iDoR2JDSbepG4CDbD/YlRwoYERERERERIWm5rkwBO1iXs8fYtb2fTURERMQ8T9JSkg6Q9DdJd0m6s85KcoCkpZvON5yu5gaQ9AxJh0g6SNJykvaVdLmk4yWt1HS+4SR3/0nasmd5KUmHSbpM0s9qa6NWqr+Hy9flyZKuo8wwccMQradaZZjs57c9u6SLJX1OM2ba6wxJi0vaT2U2rHslTVeZOWjnfuZIASMiIiKi/Y4H7gY2sb2s7eWAV9d1xzeabGRdzQ1wBKVf903AWZRB67amzIz1/eZijeoIkrvfescu+AZwC7ANZcrMHzSSaGxeZ/uOuvw14K221wReS/k52myo7GvQ/uzLUKZRPUvSBZL+V9IzG840VkcD1wFbUGaw+Q5lZpJNJfVt/I50IYmIiIhoOUnX2F5rVrc1rau5AST91fZ6dflG26v1bLvE9osbCzeC5O4/SRfbfkldnilrm7NLuhp4oe3HJJ1ne6OebZfbbu2Mbl3NPui98j+UMSTeBFwNHGP70CbzjUTSpbbX7bl/oe2X1oFTr7L9lNkD54S0wIiIiIhovxskfbq3ObqkFSV9hnLFuq26mhtmPk8+aoRtbZPc/fd0SR9XmXVkSWmmOSXbnP1g4BRJm1KmrD1Q0qskfQG4pNloo+pydgBs/9H2h4CVga8AL2s40mgelPQKeHJg6bsAbD/BoJl45qRMoxoRERHRfm8F9gT+IOnpdd1twBRgh8ZSja6ruQFOlrS47Qdsf25gpaQ1gL83mGs0yd1/PwSWqMtHAssD0yU9gxZ/mbb9XUmXAx9kxowYawK/BL7UYLRRdTj7U97Lth8HTqu3Nvsg8ENJawJXAu8BkLQC0LcZpdKFJCIiIiIiIiJar81NmiIiIiICkPQRSas0nWNWdTU3dDd7cvdfV7N3NTd0N3vNvWrTOWZHW17ztMCIiIiIaDlJ9wIPAv8EjgFOsD292VSj62pu6G725O6/rmbvam7obvau5ob2ZE8LjIiIiIj2uw5YBfgisD5wlaTTJO0saYmRH9qoruaG7mZP7v7ravau5obuZu9qbmhJ9rTAiIiIiGi53qn36v0Fga0oU/C9xvYKjYUbQVdzQ3ezJ3f/dTV7V3NDd7N3NTe0J3sKGBEREREtJ+mvttcbZtvTbD/U70xj0dXc0N3syd1/Xc3e1dzQ3exdzQ3tyZ4CRkRERETLSXqu7bZPJfkUXc0N3c2e3P3X1exdzQ3dzd7V3NCe7ClgRERERHSAJAEbACvXVTcDF7jlJ3NdzQ3dzZ7c/dfV7F3NDd3N3tXc0I7sKWBEREREtJykzYGDgX9QThihDKa2BvAh279tKttIupobups9ufuvq9m7mhu6m72ruaE92VPAiIiIiGg5SVcDW9m+ftD61YFTbD+/kWCj6Gpu6G725O6/rmbvam7obvau5ob2ZM80qhERERHttwAwbYj1NwML9jnLrOhqbuhu9uTuv65m72pu6G72ruaGlmRfoF9PFBERERGz7XDgQknHAjfVdasCOwKHNZZqdF3NDd3Nntz919XsXc0N3c3e1dzQkuzpQhIRERHRAZKeD2zHzIOnTbF9VXOpRtfV3NDd7Mndf13N3tXc0N3sXc0N7cieAkZEREREREREtF7GwIiIiIhoOUlb9iwvJelHki6T9DNJKzaZbSRdzQ3dzZ7c/dfV7F3NDd3N3tXc0J7sKWBEREREtN+Xe5a/AdwKbANcCPygkURj09Xc0N3syd1/Xc3e1dzQ3exdzQ0tyZ4uJBEREREtJ+li2y+py5fYfnHPtpnut0lXc0N3syd3/3U1e1dzQ3ezdzU3tCd7ZiGJiIiIaL+nS/o4IGBJSfKMq1BtblHb1dzQ3ezJ3X9dzd7V3NDd7F3NDS3J3vYXKSIiIiLgh8ASwOLAkcDyAJKeAVzSXKxRdTU3dDd7cvdfV7N3NTd0N3tXc0NLsqcLSURERESHSdrF9o+bzjGrupobups9ufuvq9m7mhu6m72ruaG/2VPAiIiIiOgwSTfaXq3pHLOqq7mhu9mTu/+6mr2ruaG72buaG/qbPWNgRERERLScpMuG2wS0duq9ruaG7mZP7v7ravau5obuZu9qbmhP9hQwIiIiItpvRWAL4O5B6wX8uf9xxqyruaG72ZO7/7qavau5obvZu5obWpI9BYyIiIiI9vs1sLjtSwZvkHR239OMXVdzQ3ezJ3f/dTV7V3NDd7N3NTe0JHvGwIiIiIiIiIiI1ss0qhERERERERHReilgRERERERERETrpYARERERERGdJmmWBhGUtImkX8+pPBExZ6SAERERERERnWb75U1niIg5LwWMiIiIiIjoNEkP1H83kXS2pBMl/U3S0ZJUt21Z110MvKnnsYtJOlzSBZL+Kmm7uv5ASZ+vy1tIOkdSvj9FNCjTqEZERERExNxkPeAFwL+BPwEbS5oK/BDYFLgWOK5n/72BM22/R9LSwAWSfgfsBVwo6Y/Ad4CtbT/Rvx8jIgZLBTEiIiIiIuYmF9ieVosNlwCTgOcB/7L9D9sGftqz/+bAnpIuAc4GFgFWs/0Q8D7gDOB7tv/Zt58gIoaUFhgRERERETE3eaRn+XFG/84j4M22rxli2wuBO4FnTlC2iBiHtMCIiIiIiIi53d+ASZKeU+/v1LPtdGCPnrEy1qv/Pgv4BKVLylaSNuxj3ogYQgoYERERERExV7P9MLAb8Js6iOftPZu/CCwIXCbpSuCLtZhxGPBJ2/8GdgV+JGmRPkePiB4qXcAiIiIiIiIiItorLTAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiImKWSXq3pHN77j8g6dmjPGaSJEtaYJzPfb2k14znGPU4o2aOiPZIASNiHtHkScZEyUlGREREe9le3PZ1TeeYFROVWdJrJZ0l6X5Jd0q6RNJnJC0yETkjokgBI2IeNS+fZDRJ0iaSpjWdIyIiIiaGpLcAJwI/A55lezngrcAqwKrDPKYVF4ciuiYFjIiYK0iav+kMERERcytJq0o6SdL02sLge0PsY0lr1OVFJX1D0g2S7pV0rqRFh3jMm2t3kHVGef531mPdKWnvQdvmk7SnpH/W7cdLWrZuO1XS7oP2v1TSm2Yls6SNJP1Z0j318ZvU9QK+Cexn+4e27wKwfY3tPWz/o+63r6QTJf1U0n3AuyU9U9IUSXdJulbS+3oyHiHpSz33Z7oAUl+zvSRdJeluST9Oa4+YF6SAETEXasFJxpAf8nXb2ZK+KOlPtZnlbyUtX7fNyknGEZIOkXSKpAeBV0t6fj3+PZKulLRtz3GOkHSQpN/U5z1f0nMGvR4fkvSPuv2Lkp5Tf4776snQQj37v16leeg9dZ8X9Wy7XtInJV1WX8/jJC0iaTHgVOCZKt1hHpD0zJFey4iIiKapXCT4NXADMAlYGTh2lId9HVgfeDmwLPBp4IlBx90F+ArwGttXjPD8awOHAO8EngksR2ndMGAP4A3Aq+r2u4GD6rZjgJ0GHetZwG/GmlnSynX/L9X1nwR+LmkFYK2a5ecjvBYDtqO01FgaOJryGk6rmbcHvixp0zEcZ8DbgS2A5wDPBT43C4+N6KQUMAaRdLik2yUN+0d00P471MrnlZJ+NqfzRYymBScZI33ID3gbsAvwdGChug/M2knGwHH2B5YAzgd+Bfy2HncP4GhJa/XsvyPwBWAZ4Nr62F5b1Ndho/oaHAq8g9L8c52BbJLWAw4H3k85ifoBMEXSwj3H2gHYElgdeBHwbtsPAlsB/67dYRa3/e9hfraIiIi22IDyJftTth+0/bDtc4fbWdJ8wHuAj9q+2fbjtv9s+5Ge3T4GfArYxPa1ozz/9sCvbZ9Tj/H/mPk85QPA3ran1e37AturdNP4BfBiSc+q+74dOGlQltEyvwM4xfYptp+wfQYwFdgaWL4e4taeYx1bL3A8JOmdPU/zF9u/tP1EfdzGwGfq63kJ8CPgXaO8Fr2+Z/um2upjf3rOoSLmVilgPNURlC8do5K0JrAXsLHtF1D+EEc0remTjJE+5Af82Pbfbf8HOB54cV0/ppOMHifb/lM9EXgxsDhwgO3/2j6TUsjp/TD/he0LbD9GufLx4kHH+6rt+2xfCVwB/Nb2dbbvpbScWK/utxvwA9vn19frSOARSuFjwHds/7ueVPxqiOeKiIjoilWBG+rn51gsDywC/HOEfT4FHGR7LONCPRO4aeBOvSBwZ8/2ZwG/qEWDe4CrgceBFW3fT7kQsmPddyfKOcCsZH4W8JaB49fneAWwUk+OlXry7Wh7aeBioLeL6009y88E7qr5BtxAufA0Vr3Hu6EeM2KulgLGILbPAe7qXVebkZ8m6SJJf5T0vLrpfZQ/vHfXx97e57gRQ2n6JGOkD/kBt/YsP0QpPDALJxkDBp8I3FSLGQMGnwgM+bw9butZ/s8Q9wf2fxbwiUE/46rMfOIw2nNFRER0xU3Aahr7wJN3AA9TujYMZ3Pgc5LePIbj3ULPYJiSnkZpAdmbbyvbS/fcFrF9c91+DLCTpJdRznnOmsXMNwE/GXT8xWwfAFwD3Ay8aQw/h3uW/w0sK2mJnnWr1WMBPAg8rWfbM4Y4Xu8AoavVY0bM1VLAGJtDgT1sr09p6n5wXf9c4Lm1L/95ksbUciNiDmv6JGOkD/mxGMtJxoDBJwKr1hYlA3pPBCbSTcD+g37Gp9k+ZgyP9ei7REREtMoFlCLCAZIWq+M6bTzczvViwuHAN1UGqpxf0ssGdbW8ktLq+aDeMauGcSLwekmvqONR7cfM32O+D+w/0IJT0gqStuvZfgrl4sN+wHGDLnaMJfNPgW0kbVHXL6IyqOYq9XGfAPaR9D5Jy6hYE1hxhNfoJuDPwP/V470I2LU+F8AlwNaSlpX0DIZu6f1hSauoDFi6N3DcSC9ixNwgBYxRSFqcMi7ACZIuofR1H7iSvACwJrAJ5UrxDyUt3f+UETNp+iRj2A/5MeYf9SRjGOdTWjp8WtKCKgOHbsPo43/Mjh8CH5C0YT1JWUzS6wZdRRnObcBykpaaA7kiIiImnO3HKZ+pawA3UgaefOsoD/skcDlwIaV181cY9N3D9qXA6ynn0FuN8PxXAh+mTFN6C2WQzt5WoQcCU4DfSrofOA/YsOfxjwAnAa+px5ilzLXYsB3wWWA65ULGpwZ+HtvHUca+ekfddgeli+yhwAkjPN9OlPHK/k3pRruP7d/VbT8BLgWup4zvNVRx4md123WUlrRfGmKfiLlK5h8e3XzAPbZfPMS2acD5th8F/iXp75SCxoV9zBcxE9uPS9oG+A7lJMOUD7iLR3jYJ4H/o7x3F6d8YG4x6LiXSno98BtJj9o+dZjnv6le9fgqpTXF45SiygfHmP8RSSdRxuX47FgeUx/33/pzH0wZm+Zm4F22/zbWY8zCc01Vmerse5Tf+f8A5wLnjOGxf5N0DHBdHXB17QzkGRERbWf7RspMH4Md0bOPepb/Q2k18LFB+18P9O43lRFaKvTsdyRwZM+q/Xu2PUGZyvSbIzx+V0oLh8Hrx5IZ2+dTZjkZ7vinAaeNsH3fIdZNoxRwhtr/YZ5aJPrWoPsX2v6/4Z4zYm4kO62ZB5M0iTLS8Tr1/p+Bb9k+QZKAF9Uvc1sCO9neWWUayL8CL7Z957AHj4iIiIiIGAdJ1wPv7WmxETFPSBeSQeqV0b8Aa0maJmlXykwIu0q6lNKUfqBP3enAnZKuovTT/1SKFxERERERs0bS2yU9MMTtyqazRUR7pAVGRMwySW+njAcz2A11SuGIiIiIiIgJlQJGRERERERERLReBvHssfzyy3vSpElNx4iIiOiUiy666A7bK/TjuSStxcyj8T8b+DxwVF0/iTJI4A62765jVx0IbE2Zqejdti+ux9oZ+Fw9zpfqIIFIWp8yMOGilJmRPmrbdarCpzzHcFlzXhERETF7hju3SAuMHpMnT/bUqVObjhEREdEpki6yPbmB552fMuPQhpQpFu+yfYCkPYFlbH9G0tbAHpQCxobAgbY3rMWIqcBkymxNFwHr16LHBcBHKNMznwJ8x/apkr461HMMly/nFREREbNnuHOLDOIZERERXbUZ8E/bN1AG2B6YYvFIZkz3uB1wlIvzgKUlrUSZKvoM23fVVhRnAFvWbUvaPs/lKs9Rg4411HNEREREH6SAEREREV21I3BMXV7R9i11+VZgxbq8MnBTz2Om1XUjrZ82xPqRnuNJknaTNFXS1OnTp8/WDxURERFDSwEjIiIiOkfSQsC2wAmDt9WWE3O0j+xwz2H7UNuTbU9eYYW+DAsSERExz0gBIyIiIrpoK+Bi27fV+7fV7h/Uf2+v628GVu153Cp13UjrVxli/UjPEREREX2QAkZERER00U7M6D4CMAXYuS7vDJzcs/5dKjYC7q3dQE4HNpe0jKRlgM2B0+u2+yRtVGcwedegYw31HBEREdEHmUY1IiIiOkXSYsBrgff3rD4AOF7SrsANwA51/SmUGUiupUyjuguA7bskfRG4sO63n+276vKHmDGN6qn1NtJzRERERB+kgBEREUPa+LsbNx0h+uhPe/yp6QhjZvtBYLlB6+6kzEoyeF9Tplgd6jiHA4cPsX4qsM4Q64d8jvFa/1NHTfQh50oXfe1dTUeIiIiGpQtJRERERERERLReChgRERERERER0XopYERERERERERE66WAERERERERERGtlwJGRERERERERLReChgRERERERER0XopYERERERERERE66WAERERERERERGtlwJGRERERERERLReJwsYkg6XdLukK4bZLknfkXStpMskvaTfGSMiIiIiIiJi4nSygAEcAWw5wvatgDXrbTfgkD5kioiIiIiIiIg5pJMFDNvnAHeNsMt2wFEuzgOWlrRSf9JFRERERERExETrZAFjDFYGbuq5P62uewpJu0maKmnq9OnT+xIuIiIiIiIiImbN3FrAGDPbh9qebHvyCius0HSciIiIiIiIiBjC3FrAuBlYtef+KnVdRERERERERHTQ3FrAmAK8q85GshFwr+1bmg4VERER4ydpaUknSvqbpKslvUzSspLOkPSP+u8ydd9hZyaTtHPd/x+Sdu5Zv76ky+tjviNJdf2QzxERERH90ckChqRjgL8Aa0maJmlXSR+Q9IG6yynAdcC1wA+BDzUUNSIiIibegcBptp8HrAtcDewJ/N72msDv630YZmYyScsC+wAbAhsA+/QUJA4B3tfzuIGZz4Z7joiIiOiDBZoOMDts7zTKdgMf7lOciIiI6BNJSwGvBN4NYPu/wH8lbQdsUnc7Ejgb+Aw9M5MB59XWGyvVfc+wfVc97hnAlpLOBpass5gh6SjgDcCp9VhDPUdERET0QSdbYERERMQ8a3VgOvBjSX+V9CNJiwEr9nQXvRVYsS4PNzPZSOunDbGeEZ7jSZndLCIiYs5JASMiIiK6ZAHgJcAhttcDHmRQV47a2sJzMsRwz5HZzSIiIuacFDAiIiKiS6YB02yfX++fSClo3Fa7hlD/vb1uH25mspHWrzLEekZ4joiIiOiDFDAiIiKiM2zfCtwkaa26ajPgKsoMZAMziewMnFyXh5uZ7HRgc0nL1ME7NwdOr9vuk7RRnX3kXYOONdRzRERERB90chDPiIiImKftARwtaSHKrGO7UC7KHC9pV+AGYIe67ynA1pSZyR6q+2L7LklfBC6s++03MKAnZfayI4BFKYN3nlrXHzDMc0REREQfpIARERERnWL7EmDyEJs2G2LfYWcms304cPgQ66cC6wyx/s6hniMiIiL6I11IIiIiIiIiIqL1UsCIiIiIiIiIiNZLASMiIiIiIiIiWi8FjIiIiIiIiIhovRQwIiIiIiIiIqL1UsCIiIiIiIiIiNZLASMiIiIiIiIiWi8FjIiIiIiIiIhovRQwIiIiIiIiIqL1Gi1gSPrJWNZFRETE3EnSkpKWaDpHREREtF/TLTBe0HtH0vzA+g1liYiIiD6R9FJJlwOXAVdIulRSzgEiIiJiWI0UMCTtJel+4EWS7qu3+4HbgZObyBQRERF9dRjwIduTbD8L+DDw44YzRURERIs1UsCw/X+2lwC+ZnvJelvC9nK292oiU0RERPTV47b/OHDH9rnAYw3miYiIiJZboMknt72XpJWBZ/VmsX1Oc6kiIiKiD/4g6QfAMYCBtwJnS3oJgO2LmwwXERER7dNoAUPSAcCOwFXA43W1gVELGJK2BA4E5gd+ZPuAQdtXA44Elq777Gn7lAkLHxEREeOxbv13n0Hr16OcC2w63AMlXQ/cTzl3eMz2ZEnLAscBk4DrgR1s3y1JlPOFrYGHgHcPFEck7Qx8rh72S7aPrOvXB44AFgVOAT5q28M9x2z99BERETHLGi1gAG8E1rL9yKw8qA72eRDwWmAacKGkKbav6tntc8Dxtg+RtDblBGTSxMSOiIiI8bD96nEe4tW27+i5vyfwe9sHSNqz3v8MsBWwZr1tCBwCbFiLEfsAkykFk4vqucTddZ/3AedTzh+2BE4d4TkiIiKiD5ouYFwHLAjMUgED2AC41vZ1AJKOBbajtOQYYGDJurwU8O/xRY2IiIiJIunzQ623vd9sHnI7YJO6fCRwNqW4sB1wlG0D50laWtJKdd8zbN9V85wBbCnpbGBJ2+fV9UcBb6AUMIZ7joiIiOiDpgsYDwGXSPo9PUUM2x8Z5XErAzf13J9GuarSa1/gt5L2ABYDXjPutBERETFRHuxZXgR4PXD1GB9ryme8gR/YPhRY0fYtdfutwIp1eahzhpVHWT9tiPWM8BwRERHRB00XMKbU25ywE3CE7W9IehnwE0nr2H6idydJuwG7Aay22mpzKEpERET0sv2N3vuSvg6cPsaHv8L2zZKeDpwh6W+Dju1a3JhjhnuOnFdERETMOU3PQnKkpEWB1WxfMwsPvRlYtef+KnVdr10pfVax/RdJiwDLA7cPynAocCjA5MmT5+jJTkRERAzraZTP81HZvrn+e7ukX1C6lt4maSXbt9QuIgOf98OdM9zMjO4gA+vPrutXGWJ/RniO3mw5r4iIiJhD5mvyySVtA1wCnFbvv1jSWFpkXAisKWl1SQtRZjIZ/Lgbgc3qcZ9PaZ46fYKiR0RExDhIulzSZfV2JXAN8O0xPG4xSUsMLAObA1dQzgN2rrvtDJxcl6cA71KxEXBv7QZyOrC5pGUkLVOPc3rddp+kjeoMJu8adKyhniMiIiL6oOkuJPtSrpqcDWD7EknPHu1Bth+TtDvl5GN+4HDbV0raD5hqewrwCeCHkv6X0lf23XUAr4iIiGje63uWHwNus/3YGB63IvCLUltgAeBntk+TdCFwvKRdgRuAHer+p1CmUL2WMvbWLgC275L0RcpFEYD9Bgb0BD7EjGlUT603gAOGeY6IiIjog6YLGI/avreehAx4Yride9k+hXJS0rvu8z3LVwEbT0TIiIiImFi2b5C0LvA/ddU5wGVjeNx1wLpDrL+T2vJy0HoDHx7mWIcDhw+xfiqwzlifIyIiIvqj0S4kwJWS3gbML2lNSd8F/txwpoiIiJjDJH0UOBp4er0dXWcOi4iIiBhS0wWMPYAXUKZQPQa4D/hYk4EiIiKiL3YFNrT9+dqCciPgfQ1nioiIiBZrehaSh4C96y0iIiLmHQIe77n/eF0XERERMaRGCxiSJgOfBSb1ZrH9oqYyRURERF/8GDi/ToMK8AbgsObiRERERNs1PYjn0cCngMsZ4+CdERER0W2S5gPOo8xC9oq6ehfbf20sVERERLRe0wWM6XXK04iIiJhH2H5C0kG21wMubjpPREREdEPTBYx9JP0I+D1lIE8AbJ/UXKSIiIjog99LejNwUp3qNCIiImJETRcwdgGeByzIjC4kBlLAiIiImLu9H/g48JikhykDeNr2ks3GioiIiLZquoDxUttrNZwhIiIi+sz2Ek1niIiIiG5puoDxZ0lr276q4RwRERHRB5LmBxa1/UC9vxGwUN38V9v3NxYuIiIiWq3pAsZGwCWS/kUZA2Og+WimUY2IiJg7fQW4HfhqvX8McAWwCGVAz880lCsiIiJarukCxpYNP39ERET012bAS3vu32N7G0kC/thQpoiIiOiA+Zp8cts3ANOARymDdw7cIiIiYu40n+3Heu5/BkrzS2DxZiJFREREFzTaAkPSHsA+wG3MPAtJupBERETMnRaStMTAWBe2fwsgaSlKN5KIiIiIITXdheSjwFq272w4R0RERPTHD4HjJH3A9o0Akp4FHAL8qNFkERER0WqNdiEBbgLubThDRERE9IntbwJTgHMl3SnpLuAc4Fe2vz6WY0iaX9JfJf263l9d0vmSrpV0nKSF6vqF6/1r6/ZJPcfYq66/RtIWPeu3rOuulbRnz/ohnyMiIiL6p+kCxnXA2fUk4uMDt4YzRURExBxk+/u2VwMmAc+y/Szbh8zCIT4KXN1z/yvAt2yvAdwN7FrX7wrcXdd/q+6HpLWBHYEXUAYUP7gWReYHDgK2AtYGdqr7jvQcERER0SdNFzBuBM6gzP++RM8tIiIi5mKSVgS+DRxf768tadSigKRVgNdRu5vU2Us2BU6suxwJvKEub1fvU7dvVvffDjjW9iO2/wVcC2xQb9favs72f4Fjge1GeY6IiIjok0bHwLD9BQBJi9f7DzSZJyIiIvrmCODHwN71/t+B44DDRnnct4FPM+OCx3KUqVgHZjaZBqxcl1emdFfF9mOS7q37rwyc13PM3sfcNGj9hqM8x0wk7QbsBrDaaquN8qNERETErGi0BYakdST9FbgSuFLSRZJe0GSmiIiI6IvlbR9PnYWsFgceH+kBkl4P3G77oj7kmy22D7U92fbkFVZYoek4ERERc5WmZyE5FPi47bMAJG1CGZ385Q1mioiIiDnvQUnLUaZPR9JGjD6w98bAtpK2pky5uiRwILC0pAVqEWQV4Oa6/83AqsA0SQsASwF39qwf0PuYodbfOcJzRMQoNv7uxk1H6Iw/7fGnpiNEtFrTY2AsNlC8ALB9NrDYWB443Cjhg/bZQdJVkq6U9LOJiRwRERET4OOU2UieI+lPwFHAHiM9wPZetlexPYkyCOeZtt8OnAVsX3fbGTi5Lk+p96nbz7Ttun7HOkvJ6sCawAXAhcCadcaRhepzTKmPGe45IiIiok+aboFxnaT/B/yk3n8HZWaSEfWMEv5aSj/UCyVNsX1Vzz5rAnsBG9u+W9LTJzx9REREzBbbF0t6FbAWIOAa24/O5uE+Axwr6UvAX5kxjsZhwE8kXQvcRSlIYPtKSccDVwGPAR+2/TiApN2B04H5gcNtXznKc0RERESfNF3AeA/wBeAkShPSP9Z1o3lylHAAScdSRhS/qmef9wEH2b4bwPbtE5g7IiIixkHSh4GjBwoEkpaRtJPtg8fy+Npq8+y6fB3l3GDwPg8Dbxnm8fsD+w+x/hTglP/P3p/HWVbV9/7/680oTgzSIcggRFsNTgh9ATVXCSg0RsVZ0AgaAvdeIeLXIUKSnxgQo0mUK0a5oqJgVASi11ZRRBRnhkYQBERaHGhEQUYBAcHP74+9ynsoq6qrmqpzdlW/no/HeZy9P3vtvT6naLpWf87aa08Qn7APSZI0PCMrYLRZFJ+uqr9cjdP/sKp4M7ZK+KBHt36+Tfctylur6ksT5OFq4ZIkDd+BVfW+sZ02W/JAYFoFDEmStOYZ2RoYbarm75NsOEddrEN3T+uuwL7AB5NsNEEerhYuSdLwrZ0kYzvti431RpiPJEnquVHfQnIbcEmSM4Hbx4JV9dpVnDfV6uFjVgLntvtpf5LkR3QFjfPvd9aSJOn++hLwqSQfaPv/o8UkSZImNOoCxqfba6b+sEo4XeFiH+Dl49r8X7qZFx9JsindLSWrXCBUkiQNxZvpihb/q+2fCXxodOlIkqS+G2kBo6pOXM3z7plolfAkRwLLq2pZO7ZHksuAe4E3VdUNs5W7JElafVX1e+C49pIkSVqlkRQwkpxSVS9Ncgnd00fuo6qeuKprTLRKeFW9ZWC76J4x//r7n7EkSZoNszEGkCRJa6ZRzcA4tL0/Z0T9S5Kk0XAMIEmSVstIChhVdW3bfBFwclX9YhR5SJKk4XIMIEmSVtfIHqPaPAQ4M8k3kxySZLMR5yNJkobDMYAkSZqRkRYwquqfq+pxwMHA5sDXk3xllDlJkqS55xhAkiTN1KhnYIy5DvglcAPwJyPORZIkDY9jAEmSNC0jLWAkeU2Ss4GzgIcBB7r6uCRJC59jAEmSNFOjegrJmK2A11XVRSPOQ5IkDZdjAEmSNCOjXgPjcOCSJA9PsvXYa5Q5SZKkudfGAA9O8mqAJIuSbDvitCRJUo+NdAZGkkOAtwK/An7fwgU4hVSSpAUsyRHAEuAxwEeAdYH/BJ42yrwkSVJ/jXoRz9cBj6mqx1XVE9rL4oUkSQvfC4DnAbcDVNUv6B6tOqUkD0hyXpLvJ7k0yT+3+LZJzk2yIsmnkqzX4uu3/RXt+DYD1zq8xa9IsudAfGmLrUhy2EB8wj4kSdJwjLqAcTVwy4hzkCRJw3d3VRXdzEuSPGia590F7FZVTwK2B5Ym2QV4J3BMVT0KuAk4oLU/ALipxY9p7UiyHbAP8DhgKfD+JGsnWRt4H7AXsB2wb2vLFH1IkqQhGHUB4yrg7PYNyOvHXiPOSZIkzb1TknwA2CjJgcBXgA+u6qTq3NZ2122vAnYDTmvxE4Hnt+292z7t+O5J0uInV9VdVfUTYAWwU3utqKqrqupu4GRg73bOZH1IkqQhGPVTSH7eXuu1lyRJWgNU1b8neRZwK906GG+pqjOnc26bJXEB8Ci62RI/Bm6uqntak5XAFm17C7oZn1TVPUluoXts6xbAOQOXHTzn6nHxnds5k/UxmNtBwEEAW2/tuuSSJM2mkRYwqmrsvtUHVtUdo8xFkiQNVytYTKtoMe68e4Htk2wEfAZ47Cynttqq6njgeIAlS5bUiNORJGlBGektJEmekuQy4Idt/0lJ3j/KnCRJ0txJ8pskt072msm1qupm4GvAU+huRRn7YmZL4Jq2fQ2wVet7HWBD4IbB+LhzJovfMEUfkiRpCEa9Bsb/BvakGxRQVd8Hnj7KhCRJ0typqodU1UOB9wCH0d2GsSXwZrpxwZSSLGozL0iyAfAs4HK6QsaLW7P9gc+27WVtn3b8q23x0GXAPu0pJdsCi4HzgPOBxe2JI+vRLfS5rJ0zWR+SJGkIRr0GBlV1dbcu1h/cO6pcJEnS0DyvPUlkzHFJvg+8ZRXnbQ6c2NbBWAs4pao+32Z0npzkbcCFwIdb+w8DH0uyAriRriBBVV2a5BTgMuAe4OB2awpJDgHOANYGTqiqS9u13jxJH5IkaQhGXcC4OslTgUqyLnAo3bcokiRpYbs9ySvonvJRwL7A7as6qaouBp48QfwquieIjI/fCbxkkmsdDRw9Qfx04PTp9iFJkoZj1LeQ/E/gYLrpo9fQPc/94FEmJEmShuLlwEuBX7XXS1pMkiRpQqN+CsmvgVdMdjzJ4VX1L0NMSZIkDUFV/RTYe7LjjgEkSdJ4o56BsSoTTvmUJEkLnmMASZJ0H30vYGTSA8nSJFckWZHksCnavShJJVkyNylKkqQ5MOkYQJIkrZn6XsCoiYJt5fH3AXsB2wH7JtlugnYPoVsY9Ny5TFKSJM26CccAkiRpzdX3AsZk377sBKyoqquq6m66Fcwnuo/2KOCdwJ1zlJ8kSZobzsCQJEn3MdICRpJNJohtO7B76iSnbgFcPbC/ssUGr7MDsFVVfWEVORyUZHmS5ddff/30EpckSffL/RgDSJKkNdSoZ2B8LslDx3babSCfG9uvqrevzkWTrAW8G3jDqtpW1fFVtaSqlixatGh1upMkSTM3J2MASZK0cI26gPF2ugHMg5PsSPdty19P47xrgK0G9rdssTEPAR4PnJ3kp8AuwDIX8pQkqTdWdwwgSZLWUOuMsvOq+kKSdYEv0xUdXlBVP5rGqecDi9tU02uAfYCXD1z3FmDTsf0kZwNvrKrls5i+JElaTfdjDCBJktZQIylgJHkv911dfEPgx8AhSaiq1051flXdk+QQ4AxgbeCEqro0yZHA8qpaNle5S5Kk1Xd/xwCSJGnNNaoZGONnQlww0wtU1enA6eNib5mk7a4zvb4kSZoT93sMIEmS1kwjKWBU1YkASR4E3FlV97b9tYH1R5GTJEmae44BJEnS6hr1Ip5nARsM7G8AfGVEuUiSpOFxDCBJkmZk1AWMB1TVbWM7bfuBI8xHkiQNh2MASZI0I6MuYNyeZIexnfYYtd+OMB9JkjQcqzUGSLJVkq8luSzJpUkObfFNkpyZ5Mr2vnGLJ8mxSVYkuXhcn/u39lcm2X8wlySXtHOOTZKp+pAkScMx6gLG64BTk3wzybeATwGHjDYlSZI0BK9j9cYA9wBvqKrtgF2Ag5NsBxwGnFVVi+luTzmstd8LWNxeBwHHQVeMAI4AdgZ2Ao4YKEgcBxw4cN7SFp+sD0mSNASjegoJAFV1fpLHAo9poSuq6nejzEmSJM291R0DVNW1wLVt+zdJLge2APYGdm3NTgTOBt7c4idVVQHnJNkoyeat7ZlVdSNAkjOBpUnOBh5aVee0+EnA84EvTtGHJEkagpEUMJLsVlVfTfLCcYce3Z4B/+lR5CVJkubWbI4BkmwDPBk4F9isFTcAfgls1ra3AK4eOG1li00VXzlBnCn6GMzpILqZHmy99dbT/SiSJGkaRjUD4xnAV4HnTnCsAAsYkiQtTLMyBkjyYOC/gNdV1a1tmYruIlWVpGYh10lN1kdVHQ8cD7BkyZI5zUGSpDXNSAoYVXVEe3/1KPqXJEmjMRtjgCTr0hUvPj4wY+NXSTavqmvbLSLXtfg1wFYDp2/ZYtfw/24HGYuf3eJbTtB+qj4kSdIQjOoWktdPdbyq3j2sXCRJ0vDc3zFAeyLIh4HLx7VdBuwPvKO9f3YgfkiSk+kW7LylFSDOAN4+sHDnHsDhVXVjkluT7EJ3a8p+wHtX0YckSRqCUd1C8pApjjndUpKkhev+jgGeBrwSuCTJRS32D3RFhVOSHAD8DHhpO3Y68GxgBXAH8GqAVqg4Cji/tTtybEFP4DXAR4EN6Bbv/GKLT9aHJEkaglHdQvLPAElOBA6tqpvb/sbAu0aRkyRJmnv3dwxQVd8CMsnh3SdoX8DBk1zrBOCECeLLgcdPEL9hoj4kSdJwrDXi/p84NnABqKqb6FYTlyRJC5tjAEmSNCOjLmCsNXDvKUk2YXS3tUiSpOFxDCBJkmZk1AOFdwHfTXJq238JcPQI85EkScPhGECSJM3ISAsYVXVSkuXAbi30wqq6bJQ5SZKkuecYQJIkzdSoZ2DQBisOWCRJWsM4BpAkSTMx6jUwJEmSJEmSVskChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6btwWMJEuTXJFkRZLDJjj++iSXJbk4yVlJHjGKPCVJkiRJ0v03LwsYSdYG3gfsBWwH7Jtku3HNLgSWVNUTgdOAfx1ulpIkSZIkabbMywIGsBOwoqquqqq7gZOBvQcbVNXXquqOtnsOsOWQc5QkSZIkSbNkvhYwtgCuHthf2WKTOQD44kQHkhyUZHmS5ddff/0spihJkiRJkmbLfC1gTFuSvwaWAP820fGqOr6qllTVkkWLFg03OUmSJEmSNC3rjDqB1XQNsNXA/pYtdh9Jngn8I/CMqrprSLlJkiRJkqRZNl9nYJwPLE6ybZL1gH2AZYMNkjwZ+ADwvKq6bgQ5SpKkWZbkhCTXJfnBQGyTJGcmubK9b9ziSXJse2LZxUl2GDhn/9b+yiT7D8R3THJJO+fYJJmqD0mSNDzzsoBRVfcAhwBnAJcDp1TVpUmOTPK81uzfgAcDpya5KMmySS4nSZLmj48CS8fFDgPOqqrFwFltH7qnlS1ur4OA46ArRgBHADvTLQx+xEBB4jjgwIHzlq6iD0mSNCTz9RYSqup04PRxsbcMbD9z6ElJkqQ5VVXfSLLNuPDewK5t+0TgbODNLX5SVRVwTpKNkmze2p5ZVTcCJDkTWJrkbOChVXVOi58EPJ9uIfDJ+pAkSUMyL2dgSJIkDdisqq5t278ENmvbkz21bKr4ygniU/VxHz7dTJKkuWMBQ5IkLRhttkWNqg+fbiZJ0tyxgCFJkua7X7VbQ2jvY4t3T/bUsqniW04Qn6oPSZI0JBYwJEnSfLcMGHuSyP7AZwfi+7WnkewC3NJuAzkD2CPJxm3xzj2AM9qxW5Ps0p4+st+4a03UhyRJGpJ5u4inJEla8yT5JN1impsmWUn3NJF3AKckOQD4GfDS1vx04NnACuAO4NUAVXVjkqPoHssOcOTYgp7Aa+iedLIB3eKdX2zxyfqQJElDYgFDkiTNG1W17ySHdp+gbQEHT3KdE4ATJogvBx4/QfyGifqQJEnD4y0kkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeq9dUadgCRJkiRp9nz96c8YdQrzwjO+8fVRp6AZcgaGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6r15+xSSJEuB9wBrAx+qqneMO74+cBKwI3AD8LKq+umw85Rm28+PfMKoU9AQbf2WS0adgqQBqxp/SJKkuTMvZ2AkWRt4H7AXsB2wb5LtxjU7ALipqh4FHAO8c7hZSpKkhWSa4w9JkjRH5mUBA9gJWFFVV1XV3cDJwN7j2uwNnNi2TwN2T5Ih5ihJkhaW6Yw/JEnSHJmvt5BsAVw9sL8S2HmyNlV1T5JbgIcBv56LhHZ800lzcVn11AX/tt+oU5AkDd90xh+SJGmOzNcCxqxJchBwUNu9LckVo8xnHtqUOSoK9Vn+ff9Rp7AmWiP/rHGEE8dGYI38s5bX3q8/a4+YrTzmuwU0rujd/wdrwO/e3v3M1wC9+5nfz7+L+653P28W/gT9/v3Mp2/CscV8LWBcA2w1sL9li03UZmWSdYAN6RbzvI+qOh44fo7yXPCSLK+qJaPOQwuff9Y0LP5Z0xRWOf5YKOMK/z8YPn/mw+fPfLj8eQ/fQvyZz9c1MM4HFifZNsl6wD7AsnFtlgFjpfoXA1+tqhpijpIkaWGZzvhDkiTNkXk5A6OtaXEIcAbdY8xOqKpLkxwJLK+qZcCHgY8lWQHcSDfIkCRJWi2TjT9GnJYkSWuMeVnAAKiq04HTx8XeMrB9J/CSYee1Bpr302Q1b/hnTcPinzVNaqLxxwLl/wfD5898+PyZD5c/7+FbcD/zeFeFJEmSJEnqu/m6BoYkSZIkSVqDWMCQJEmSJEm9ZwFDkiRJkiT1ngUMSb2U5LFJdk/y4HHxpaPKSQtfkpNGnYOkhc/fccOXZKck/61tb5fk9UmePeq81iT+jh2uJH/R/pzvMepcZpOLeGpWJHl1VX1k1HloYUjyWuBg4HJge+DQqvpsO/a9qtphhOlpgUiybHwI+EvgqwBV9byhJyX1jL/fZ5+/44YvyRHAXnRPYDwT2Bn4GvAs4IyqOnqE6S1I/o4dviTnVdVObftAur9nPgPsAXyuqt4xyvxmiwUMzYokP6+qrUedhxaGJJcAT6mq25JsA5wGfKyq3pPkwqp68mgz1EKQ5HvAZcCHgKIbXH0S2Aegqr4+uuykfvD3++zzd9zwtZ/59sD6wC+BLavq1iQbAOdW1RNHmd9C5O/Y4Rv8+yPJ+cCzq+r6JA8CzqmqJ4w2w9mxzqgT0PyR5OLJDgGbDTMXLXhrVdVtAFX10yS7AqcleQTdnzdpNiwBDgX+EXhTVV2U5LcOqrSm8ff70Pk7bvjuqap7gTuS/LiqbgWoqt8m+f2Ic1uo/B07fGsl2ZhumYhU1fUAVXV7kntGm9rssYChmdgM2BO4aVw8wHeGn44WsF8l2b6qLgJo31I9BzgBWBDVY41eVf0eOCbJqe39V/h7UWsmf78Pl7/jhu/uJA+sqjuAHceCSTYELGDMAX/HjsSGwAV0f3dXks2r6tq21s6CKY76h0gz8XngwWO/cAclOXvo2Wgh2w+4T6W4qu4B9kvygdGkpIWqqlYCL0nyV8Cto85HGgF/vw+Xv+OG7+lVdRf84R/WY9YF9h9NSmsGf8cOT1VtM8mh3wMvGGIqc8o1MCRJkiRJUu/5GFVJkiRJktR7FjAkSZIkSVLvWcCQNHRJZrQoXJJdk3x+rvKRJEnzm2MLac1gAUPS0FXVU0edgyRJWjgcW0hrBgsYkoYuyW3tfdckZyc5LckPk3w8SdqxpS32PeCFA+c+KMkJSc5LcmGSvVv8PUne0rb3TPKNJP4dJ0nSGsCxhbRm8DGqkkbtycDjgF8A3waelmQ58EFgN2AF8KmB9v8IfLWq/ibJRsB5Sb4CHA6cn+SbwLHAs8c9Kk2SJK0ZHFtIC5QVREmjdl5VrWwDgouAbYDHAj+pqiure9bzfw603wM4LMlFwNnAA4Ctq+oO4EDgTOA/qurHQ/sEkiSpTxxbSAuUMzAkjdpdA9v3suq/lwK8qKqumODYE4AbgIfPUm6SJGn+cWwhLVDOwJDURz8EtknyyLa/78CxM4C/G7if9cnt/RHAG+imje6VZOch5itJkvrNsYW0AFjAkNQ7VXUncBDwhbbQ1nUDh48C1gUuTnIpcFQbcHwYeGNV/QI4APhQkgcMOXVJktRDji2khSHdLWCSJEmSJEn95QwMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJElDleRVSb41sH9bkj9bxTnbJKkk68x9hpL6yAKGpDnnIEWSJE2lqh5cVVeNOo/Z4BhGmjv+TyVp6KrqwaPOQZIkaVSSrFNV94w6D2m+cQaGJEmSpDmTZKskn05yfZIbkvzHBG0qyaPa9gZJ3pXkZ0luSfKtJBtMcM6Lkvw0yeNX0f9fJPlOkpuTXJ3kVS2+YZKTWl4/S/JPSdZqx96a5D8HrnGfWRVJzk5yVJJvJ/lNki8n2bQ1/0Z7v7nNOn1Km4367STHJLkBODLJjUmeMNDHnyS5I8mimfx8pTWJBQxJs6oHg5TnJbm0DVLOTvLnA8fenOSaNtC4Isnus/GZJUnSxJKsDXwe+BmwDbAFcPIqTvt3YEfgqcAmwN8Dvx933VcD7wSeWVU/mKL/RwBfBN4LLAK2By5qh98LbAj8GfAMYD/g1dP8aAAvb+3/BFgPeGOLP729b9Rujflu298ZuArYDDiK7ufw1wPX2xc4q6qun0EO0hrFAoakWdODQcqjgU8Cr6MbpJwOfC7JekkeAxwC/LeqegiwJ/DTGX1ASVoNSU5Icl2SSf/+Gtf+pUkua8XYT8x1ftIc2wl4OPCmqrq9qu6sqm9N1rjNgPgb4NCquqaq7q2q71TVXQPNXge8Cdi1qlasov+XA1+pqk9W1e+q6oaquqiNWfYBDq+q31TVT4F3Aa+cwWf7SFX9qKp+C5xCVxyZyi+q6r1VdU8750Rg3yRpx18JfGwG/UtrHAsYkmbTqAcpLwO+UFVnVtXv6IojG9AVR+4F1ge2S7JuVf20qn68uh9Ukmbgo8DS6TRMshg4HHhaVT2O7u9AaT7bCvjZDNZ72BR4ADDV7+g3Ae+rqpXT7H+ia20KrEv3pcuYn9F9+TJdvxzYvgNY1RpfVw/uVNW57bxdkzwWeBSwbAb9S2scCxiSZtOoBykPZ2AgUlW/pxssbNGKH68D3gpcl+TkJA+fZp6StNqq6hvAjYOxJI9M8qUkFyT5ZvvHC8CBdH/n3dTOvW7I6Uqz7Wpg6xk8kePXwJ3AI6doswfwT0leNM3+J7rWr4HfAY8YiG0NXNO2bwceOHDsT6fR15iaQfxEuttIXgmcVlV3zqAfaY1jAUPSbBr1IOUXDAxE2pTMrWiDkar6RFX9RWtTdLelSNIoHA/8XVXtSHff/Ptb/NHAo9tif+ckmdbMDanHzgOuBd6R5EFJHpDkaZM1bl8+nAC8O8nDk6zdFsFcf6DZpXSzmt6X5Hmr6P/jwDPbrVnrJHlYku2r6l662z6OTvKQtlbG64GxhTsvAp6eZOskG9LNjJqu6+luh53ykfHNfwIvoCtinDSDPqQ1kgUMSbNp1IOUU4C/SrJ7knWBNwB3Ad9J8pgku7Vr3wn8lnFrbUjSMCR5MN2tbacmuQj4ALB5O7wOsBjYlW5Bvw8m2Wj4WUqzoxUKnkt3e8TPgZV0t3xO5Y3AJcD5dLOX3sm4f7dU1feB59D9P7LXFP3/HHg23ZjgRrrCxJPa4b+jm2lxFfAt4BN04xKq6kzgU8DFwAV0a3xNS1XdARwNfLstKr7LFG2vBr5H98XKN6fbh7SmStVkM5wkaeaSbA0cC/x3ul/Gn6D7xfy3bfYDSQpYXFUr2hNH/gV4Cd29o9+nW2BzM+AnwLpVdU+SJcAXgFdV1Ren6P8FdIOGLegGKa+pqkuTPBH4EPDndFNGvwMcVFW/mOUfgST9kSTbAJ+vqscneShwRVVtPkG7/wOcW1UfaftnAYdV1flDTVjS0CQ5gW6Bz38adS5S31nAkCRJmmODBYy2/x3gmKo6td3u9sSq+n67ZWTfqto/yabAhcD2VXXDyJKXNGfa3w0XAU+uqp+MNhup/7yFRJIkaQ4l+STwXeAxSVYmOQB4BXBAku/T3Sq3d2t+BnBDksuAr9E91cnihTSFJK9IctsEr0tHndtUkhwF/AD4N4sX0vQ4A0PSvJLkFXT3i4/3s/bIQUmSJEkLkAUMSZIkSZLUe9N91OEaYdNNN61tttlm1GlIkjSvXHDBBb+uqkWjzqNvHFdIkrR6JhtbWMAYsM0227B8+fJRpyFJ0ryS5GejzqGPHFdIkrR6JhtbuIinJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfd8jKokzQNH//WLR53CavnH/zxt1ClIvbLjm04adQp/5IJ/22/UKUiSNC3OwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZK0xktyQpLrkvxgkuNJcmySFUkuTrLDsHOUJGlNZwFDkiQJPgosneL4XsDi9joIOG4IOUmSpAFzXsBIslGS05L8MMnlSZ6SZJMkZya5sr1v3NpO+u1Gkv1b+yuT7D8Q3zHJJe2cY5OkxSfsQ5Ikabyq+gZw4xRN9gZOqs45wEZJNh9OdpIkCYYzA+M9wJeq6rHAk4DLgcOAs6pqMXBW24dJvt1IsglwBLAzsBNwxEBB4jjgwIHzxr49mawPSZKkmdoCuHpgf2WLSZKkIZnTAkaSDYGnAx8GqKq7q+pmum8xTmzNTgSe37Yn+3ZjT+DMqrqxqm4CzgSWtmMPrapzqqqAk8Zda6I+JEmS5kSSg5IsT7L8+uuvH3U6kiQtKHM9A2Nb4HrgI0kuTPKhJA8CNquqa1ubXwKbte3Jvt2YKr5ygjhT9HEfDjQkSdI0XANsNbC/ZYvdR1UdX1VLqmrJokWLhpacJElrgrkuYKwD7AAcV1VPBm5n3K0cbeZEzWUSU/XhQEOSJE3DMmC/tl7XLsAtA1+USJKkIZjrAsZKYGVVndv2T6MraPxqbOGr9n5dOz7ZtxtTxbecIM4UfUiSJN1Hkk8C3wUek2RlkgOS/M8k/7M1OR24ClgBfBB4zYhSlSRpjbXOXF68qn6Z5Ookj6mqK4Ddgcvaa3/gHe39s+2UZcAhSU6mW7Dzlqq6NskZwNsHFu7cAzi8qm5Mcmv7JuRcYD/gvQPXmqgPSZKk+6iqfVdxvICDh5SOJEmawJwWMJq/Az6eZD26by5eTTfz45QkBwA/A17a2p4OPJvu2407WltaoeIo4PzW7siqGnvU2Wvont2+AfDF9oKucDFRH5IkSZIkaZ6Z8wJGVV0ELJng0O4TtJ30242qOgE4YYL4cuDxE8RvmKgPSZIkSZI0/8z1GhiSJEmSJEn3mwUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvTfnBYwkP01ySZKLkixvsU2SnJnkyva+cYsnybFJViS5OMkOA9fZv7W/Msn+A/Ed2/VXtHMzVR+SJEmSJGn+GdYMjL+squ2raknbPww4q6oWA2e1fYC9gMXtdRBwHHTFCOAIYGdgJ+CIgYLEccCBA+ctXUUfkiRJkiRpnhnVLSR7Aye27ROB5w/ET6rOOcBGSTYH9gTOrKobq+om4ExgaTv20Ko6p6oKOGnctSbqQ5IkSZIkzTPDKGAU8OUkFyQ5qMU2q6pr2/Yvgc3a9hbA1QPnrmyxqeIrJ4hP1cd9JDkoyfIky6+//voZfzhJkiRJkjT3hlHA+Iuq2oHu9pCDkzx98GCbOVFzmcBUfVTV8VW1pKqWLFq0aC7TkCRJPZVkaZIr2ppaf3TbaZKtk3wtyYVtna5njyJPSZLWZHNewKiqa9r7dcBn6Naw+FW7/YP2fl1rfg2w1cDpW7bYVPEtJ4gzRR+SJEl/kGRt4H10X7ZsB+ybZLtxzf4JOKWqngzsA7x/uFlKkqQ5LWAkeVCSh4xtA3sAPwCWAWNPEtkf+GzbXgbs155GsgtwS7sN5AxgjyQbt8U79wDOaMduTbJLe/rIfuOuNVEfkiRJg3YCVlTVVVV1N3Ay3Vpagwp4aNveEPjFEPOTJEnAOnN8/c2Az7Qnm64DfKKqvpTkfOCUJAcAPwNe2tqfDjwbWAHcAbwaoKpuTHIUcH5rd2RV3di2XwN8FNgA+GJ7Abxjkj4kSZIGTbTW1s7j2ryVbk2vvwMeBDxzogu19b4OAth6661nPVFJktZkc1rAqKqrgCdNEL8B2H2CeAEHT3KtE4ATJogvBx4/3T4kSZJWw77AR6vqXUmeAnwsyeOr6veDjarqeOB4gCVLlszpGl+SJK1pRvUYVUmSpL6YbK2tQQcApwBU1XeBBwCbDiU7SZIEWMCQJEk6H1icZNsk69Et0rlsXJuf02Z2JvlzugKGz1+XJGmILGBIkqQ1WlXdAxxCt2j45XRPG7k0yZFJnteavQE4MMn3gU8Cr2q3vkqSpCGZ60U8JUmShibJE6rqkpmeV1Wn0y0mPhh7y8D2ZcDT7n+GkiRpdTkDQ5IkLSTvT3Jektck2XDUyUiSpNljAUOSJC0YVfXfgVfQLcp5QZJPJHnWiNOSJEmzwAKGJElaUKrqSuCfgDcDzwCOTfLDJC8cbWaSJOn+sIAhSZIWjCRPTHIM3WKcuwHPrao/b9vHjDQ5SZJ0v7iIpyRJWkjeC3wI+Ieq+u1YsKp+keSfRpeWJEm6vyxgSJKkBSHJ2sA1VfWxiY5PFpckSfODt5BIkqQFoaruBbZKst6oc5EkSbPPGRiSJGkh+Qnw7STLgNvHglX17tGlJEmSZoMFDEmStJD8uL3WAh7SYjW6dCRJ0myxgCFJkhaSy6rq1MFAkpeMKhlJkjR7XANDkiQtJIdPMyZJkuYZZ2BIkqR5L8lewLOBLZIcO3DoocA9o8lKkiTNpqHMwEiydpILk3y+7W+b5NwkK5J8amy18CTrt/0V7fg2A9c4vMWvSLLnQHxpi61IcthAfMI+JEnSgvQLYDlwJ3DBwGsZsOcU50mSpHlitQoYSTZO8sQZnHIocPnA/juBY6rqUcBNwAEtfgBwU4sf09qRZDtgH+BxwFLg/a0osjbwPmAvYDtg39Z2qj4kSdICU1Xfr6oTgUdV1YkDr09X1U2jzk+SJN1/0y5gJDk7yUOTbAJ8D/hgklU+kizJlsBfAR9q+wF2A05rTU4Ent+29277tOO7t/Z7AydX1V1V9RNgBbBTe62oqquq6m7gZGDvVfQhSZIWrp2SnJnkR0muSvKTJFeNOilJknT/zWQNjA2r6tYkfwucVFVHJLl4Guf9b+Dv+X+PMnsYcHNVjd2PuhLYom1vAVwNUFX3JLmltd8COGfgmoPnXD0uvvMq+riPJAcBBwFsvfXW0/g4kiSpxz4M/H90t4/cO+JcJEnSLJrJLSTrJNkceCnw+emckOQ5wHVVdcHqJDcMVXV8VS2pqiWLFi0adTqSJOn+uaWqvlhV11XVDWOvUSclSZLuv5nMwDgSOAP4dlWdn+TPgCtXcc7TgOcleTbwALqVwN8DbJRknTZDYkvgmtb+GmArYGWSdYANgRsG4mMGz5kofsMUfUiSpIXra0n+Dfg0cNdYsKq+N7qUJEnSbJh2AaOqTgVOHdi/CnjRKs45nPbs9SS7Am+sqlckORV4Md2aFfsDn22nLGv7323Hv1pVlWQZ8Im25sbDgcXAeUCAxUm2pStQ7AO8vJ3ztUn6kCRJC9fO7X3JQKzo1saSJEnz2LQLGG0xzvfSzaoA+CZwaFWtXI1+3wycnORtwIV096vS3j+WZAVwI11Bgqq6NMkpwGV0z3I/uKrubXkdQjczZG3ghKq6dBV9SJKkBaqq/nLUOUiSpLkxk1tIPgJ8AnhJ2//rFnvWdE6uqrOBs9v2VXRPEBnf5s6B648/djRw9ATx04HTJ4hP2IckSVq4krxlonhVHTnsXCRJ0uyaySKei6rqI1V1T3t9FHDVS0mS1Ce3D7zuBfYCtlnVSUmWJrkiyYokh03S5qVJLktyaZJPzGbSkiRp1WYyA+OGJH8NfLLt70u3WKYkSVIvVNW7BveT/DvdraaTSrI28D66WaUrgfOTLKuqywbaLKZb1+tpVXVTkj+Z9eQlSdKUZjID42/oHqH6S+BaugUyXz0XSUmSJM2SB9I9jWwqOwErquqqqrqbbgHwvce1ORB4X1XdBFBV1816ppIkaUozeQrJz4DnTXY8yeFV9S+zkpUkSdJqSHIJ3VNHoFvgexHdo+CnsgVw9cD+Sv7f00zGPLpd/9vtum+tqi9N0P9BwEEAW2+99UzTlyRJU5jJLSSr8hLAAoYkSRql5wxs3wP8qqrumYXrrkP3GPdd6WZ0fCPJE6rq5sFGVXU8cDzAkiVLCkmSNGtmcgvJqmQWryVJkjRjbcboRsBzgRcA203jtGuArQb2t2yxQSuBZVX1u6r6CfAjuoKGJEkaktksYPgtgyRJGqkkhwIfB/6kvT6e5O9Wcdr5wOIk2yZZD9gHWDauzf+lm31Bkk3pbim5avYylyRJqzKbt5A4A0OSJI3aAcDOVXU7QJJ3At8F3jvZCVV1T5JD6J5WsjZwQlVdmuRIYHlVLWvH9khyGd3jWd9UVT6NTZKkIZp2ASPJJlV147jYtm0aJcCps5qZJEnSzIWuwDDmXqbxJUtVnQ6cPi72loHtAl7fXpIkaQRmMgPjc0n2qqpbAZJsB5wCPB6gqt4+B/lJkiTNxEeAc5N8pu0/H/jw6NKRJEmzZSYFjLfTFTH+CngMcBLwijnJSpIkaTVU1buTnA38RQu9uqouHGFKkiRplky7gFFVX0iyLvBl4CHAC6rqR3OWmSRJ0gwl2QW4tKq+1/YfmmTnqjp3xKlJkqT7aZUFjCTv5b5PGNkQ+DFwSBKq6rVzlZwkSdIMHQfsMLB/2wQxSZI0D01nBsbycfsXzEUikiRJsyBtwU0Aqur3SWbzqWuSJGlEVvkLvapOBEjyIODOqrq37a8NrD+36UmSJM3IVUleSzfrAuA1wFUjzEeSJM2StWbQ9ixgg4H9DYCvzG46kiRJ98v/BJ4KXAOsBHYGDhppRpIkaVbMZErlA6rqtrGdqrotyQOnOiHJA4Bv0M3UWAc4raqOSLItcDLwMLpbUl5ZVXcnWZ/u6SY7AjcAL6uqn7ZrHQ4cQPc899dW1RktvhR4D7A28KGqekeLT9jHDD7vH+z4ppNW57SRu+Df9ht1CpIkDVVVXQfsM9nxJIdX1b8MMSVJkjRLZjID4/Ykf1gAK8mOwG9Xcc5dwG5V9SRge2BpWx38ncAxVfUo4Ca6wgTt/aYWP6a1I8l2dIORxwFLgfcnWbvdxvI+YC9gO2Df1pYp+pAkSWuul4w6AUmStHpmUsB4HXBqkm8m+RbwKeCQqU6oztisjXXbq4DdgNNa/ETg+W1777ZPO757krT4yVV1V1X9BFgB7NReK6rqqja74mRg73bOZH1IkqQ1V0adgCRJWj3TvoWkqs5P8ljgMS10RVX9blXntVkSFwCPopst8WPg5qq6pzVZCWzRtrcArm793ZPkFrpbQLYAzhm47OA5V4+L79zOmayP8fkdRLs3duutt17Vx5EkSfNbrbqJJEnqo1UWMJLsVlVfTfLCcYcenYSq+vRU57enlmyfZCPgM8BjVzvbOVBVxwPHAyxZssRBjSRJC5szMCRJmqemMwPjGcBXgedOcKyAKQsYf2hYdXOSrwFPATZKsk6bIbEl3UrhtPetgJXtme0b0i3mORYfM3jORPEbpuhDkiQtUEk2qaobx8W2bbegApw6grQkSdIsWOUaGFV1RHt/9QSvv5nq3CSL2swLkmwAPAu4HPga8OLWbH/gs217WdunHf9qVVWL75Nk/fZ0kcXAecD5wOIk2yZZj26hz2XtnMn6kCRJC9fnkjx0bKct7v25sf2qevtIspIkSffbdG4hef1Ux6vq3VMc3hw4sa2DsRZwSlV9PsllwMlJ3gZcCHy4tf8w8LEkK4AbaY9Bq6pLk5wCXAbcAxzcbk0hySHAGXSPUT2hqi5t13rzJH1IkqSF6+10RYy/olu36yTgFaNNSZIkzYbp3ELykCmOTblmRFVdDDx5gvhVdE8QGR+/k0keb1ZVRwNHTxA/HTh9un1IkqSFq6q+kGRd4Mt0Y5gXVNWPRpyWJEmaBassYFTVPwMkORE4tKpubvsbA++a0+wkSZKmIcl7ue8XKxvSPfnskLbo+GtHk5kkSZot036MKvDEseIFQFXdlOSPZldIkiSNwPJx+xeMJAtJkjRnZlLAWCvJxlV1E3SrfM/wfEmSpDlRVScCJHkQcOfAWllrA+uPMjdJkjQ7ZlKAeBfw3SRjjx97CROsSSFJkjRCZwHPBG5r+xvQrYfx1JFlJEmSZsW0CxhVdVKS5cBuLfTCqrpsbtKSJElaLQ+oqrHiBVV1W5IHjjIhSZI0O9aaSeOquqyq/qO9LF5IkqS+uT3JDmM7SXYEfruqk5IsTXJFkhVJDpui3YuSVJIls5SvJEmaJtewkCRJC8nrgFOT/AII8KfAy6Y6oa2T8T7gWcBK4Pwky8Z/WZPkIcChwLlzkLckSVoFCxiSJGnBqKrzkzwWeEwLXVFVv1vFaTsBK6rqKoAkJwN7A+Nnmx4FvBN40yymLEmSpmlGt5BIkiT1UZLd2vsLgecCj26v57bYVLYArh7YX9lig9ffAdiqqr6wijwOSrI8yfLrr79+hp9CkiRNxRkYkiRpIXgG8FW64sV4BXx6dS+cZC3g3cCrVtW2qo4HjgdYsmRJrW6fkiTpj1nAkCRJ815VHdHeX70ap18DbDWwv2WLjXkI8Hjg7CTQrauxLMnzqmr56mUsSZJmygKGJEma95K8fqrjVfXuKQ6fDyxOsi1d4WIf4OUD594CbDrQ19nAGy1eSJI0XBYwJEnSQvCQKY5NeStHVd2T5BDgDGBt4ISqujTJkcDyqlo2i3lKkqTVZAFDkiTNe1X1zwBJTgQOraqb2/7GwLumcf7pwOnjYm+ZpO2u9zNdSZK0GnwKiSRJWkieOFa8AKiqm4Anjy4dSZI0W+a0gJFkqyRfS3JZkkuTHNrimyQ5M8mV7X3jFk+SY5OsSHJxe2TZ2LX2b+2vTLL/QHzHJJe0c45NW11rsj4kSdKCttbg7/wkm+CMU0mSFoS5noFxD/CGqtoO2AU4OMl2wGHAWVW1GDir7QPsBSxur4OA4+APg48jgJ2BnYAjBgYnxwEHDpy3tMUn60OSJC1c7wK+m+SoJEcB3wH+dcQ5SZKkWTCnBYyquraqvte2fwNcDmwB7A2c2JqdCDy/be8NnFSdc4CNkmwO7AmcWVU3tqmgZwJL27GHVtU5VVXASeOuNVEfkiRpgaqqk4AXAr9qrxdW1cdGm5UkSZoNQ5tSmWQbuntQzwU2q6pr26FfApu17S2AqwdOW9liU8VXThBnij4kSdICVlWXAZeNOg9JkjS7hrKIZ5IHA/8FvK6qbh081mZOTPl4s/trqj6SHJRkeZLl119//VymIUmSJEmSVtOcFzCSrEtXvPh4VX26hX/Vbv+gvV/X4tcAWw2cvmWLTRXfcoL4VH3cR1UdX1VLqmrJokWLVu9DSpIkSZKkOTXXTyEJ8GHg8qp698ChZcDYk0T2Bz47EN+vPY1kF+CWdhvIGcAeSTZui3fuAZzRjt2aZJfW137jrjVRH5IkSZIkaZ6Z6zUwnga8ErgkyUUt9g/AO4BTkhwA/Ax4aTt2OvBsYAVwB/BqgKq6sa0kfn5rd2RV3di2XwN8FNgA+GJ7MUUfkiRJkiRpnpnTAkZVfQvIJId3n6B9AQdPcq0TgBMmiC8HHj9B/IaJ+pAkSZIkSfPPUBbxlCRJkiRJuj8sYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJWuMlWZrkiiQrkhw2wfHXJ7ksycVJzkryiFHkKUnSmswChiRJWqMlWRt4H7AXsB2wb5LtxjW7EFhSVU8ETgP+dbhZSpIkCxiSJGlNtxOwoqquqqq7gZOBvQcbVNXXquqOtnsOsOWQc5QkaY23zqgTkCRpTfPWt7511Cmslvma9zRsAVw9sL8S2HmK9gcAX5zoQJKDgIMAtt5669nKT5Ik4QwMSZKkaUvy18AS4N8mOl5Vx1fVkqpasmjRouEmJ0nSAjenBYwkJyS5LskPBmKbJDkzyZXtfeMWT5Jj2+JZFyfZYeCc/Vv7K5PsPxDfMckl7Zxjk2SqPiRJkiZwDbDVwP6WLXYfSZ4J/CPwvKq6a0i5SZKkZq5nYHwUWDoudhhwVlUtBs5q+9AtnLW4vQ4CjoOuGAEcQTeVcyfgiIGCxHHAgQPnLV1FH5IkSeOdDyxOsm2S9YB9gGWDDZI8GfgAXfHiuhHkKEnSGm9OCxhV9Q3gxnHhvYET2/aJwPMH4idV5xxgoySbA3sCZ1bVjVV1E3AmsLQde2hVnVNVBZw07loT9SFJknQfVXUPcAhwBnA5cEpVXZrkyCTPa83+DXgwcGqSi5Ism+RykiRpjoxiEc/Nquratv1LYLO2PdECWlusIr5ygvhUfUiSJP2RqjodOH1c7C0D288celKSJOk+RrqIZ5s5UaPsI8lBSZYnWX799dfPZSqSJEmSJGk1jaKA8at2+wftfew+0skW0JoqvuUE8an6+COuFi5JkiRJUv+NooCxDBh7ksj+wGcH4vu1p5HsAtzSbgM5A9gjycZt8c49gDPasVuT7NKePrLfuGtN1IckSZIkSZqH5nQNjCSfBHYFNk2yku5pIu8ATklyAPAz4KWt+enAs4EVwB3AqwGq6sYkR9GtEA5wZFWNLQz6GronnWwAfLG9mKIPSZIkSZI0D81pAaOq9p3k0O4TtC3g4EmucwJwwgTx5cDjJ4jfMFEfkiRJkiRpfhrpIp6SJEmSJEnTYQFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT13jqjTkDS3Pj6058x6hRWyzO+8fVRpyBJkiSph5yBIUmSJEmSem9BFzCSLE1yRZIVSQ4bdT6SJKmfVjVmSLJ+kk+14+cm2WYEaUqStEZbsLeQJFkbeB/wLGAlcH6SZVV12WgzkzSb/uMNnxt1CqvlkHc9d9QpSGqmOWY4ALipqh6VZB/gncDLhp+tJGnUTjl1p1GnMKGXvuS8Uacw5xZsAQPYCVhRVVcBJDkZ2BuwgCFJkgZNZ8ywN/DWtn0a8B9JUlU1zEQlSbo/nnTaGaNO4Y98/8V7TrttFurv3SQvBpZW1d+2/VcCO1fVIePaHQQc1HYfA1wx1ERhU+DXQ+5zFPycC4ufc2Hxcy4so/icj6iqRUPuc9ZMZ8yQ5Aetzcq2/+PW5tfjrjXMccV8/TNt3sM3X3Ofr3nD/M19vuYN8zd3857YhGOLhTwDY1qq6njg+FH1n2R5VS0ZVf/D4udcWPycC4ufc2FZUz5nXw1zXDFf/1ub9/DN19zna94wf3Ofr3nD/M3dvGdmIS/ieQ2w1cD+li0mSZI0aDpjhj+0SbIOsCFww1CykyRJwMIuYJwPLE6ybZL1gH2AZSPOSZIk9c90xgzLgP3b9ouBr7r+hSRJw7VgbyGpqnuSHAKcAawNnFBVl444rYmM7PaVIfNzLix+zoXFz7mwrCmfc9ZMNmZIciSwvKqWAR8GPpZkBXAjXZFj1Obrf2vzHr75mvt8zRvmb+7zNW+Yv7mb9wws2EU8JUmSJEnSwrGQbyGRJEmSJEkLhAUMSZIkSZLUexYwJEmSJElS7y3YRTw1Wkl2Aqqqzk+yHbAU+GFVnT7i1CRJklbJsYwWuoGnLv2iqr6S5OXAU4HLgeOr6ncjTVCagIt4DlmSxwJbAOdW1W0D8aVV9aXRZTZ7khwB7EVXIDsT2Bn4GvAs4IyqOnqE6c2ZJH8B7AT8oKq+POp8JEnqg/k49llIY5n5Mj5JsjNweVXdmmQD4DBgB+Ay4O1VdctIE5xCktcCn6mqq0edy0wk+Tjdn/EHAjcDDwY+DexO9+/E/Sc/e7SS/BnwQmAr4F7gR8AnqurWkSamOWcBY4jaX24H01U1twcOrarPtmPfq6odRpjerElyCd3nWx/4JbDlwC+jc6vqiaPMb7YkOa+qdmrbB9L9t/0MsAfwuap6xyjzk7TwJdkQOBx4PvAnQAHXAZ8F3lFVN48sOQ1NkldX1UdGncdE5uvYZz6PZebr+CTJpcCT2mONjwfuAE6j+8f0k6rqhSNNcApJbgFuB34MfBI4taquH21Wq5bk4qp6YpJ1gGuAh1fVvUkCfL+vf87b3yvPAb4BPBu4kK4A8wLgNVV19siS05xzDYzhOhDYsaqeD+wK/P+SHNqOZVRJzYF7qureqroD+PFYJbSqfgv8frSpzap1B7YPAp5VVf9MN0B4xWhSmn1JHprkX5J8rE0tHDz2/lHlNduS/GmS45K8L8nDkrw1ySVJTkmy+ajzmy1Jvpfkn5I8ctS5zKUkD05yZJJLk9yS5Pok5yR51ahzm2WnADcBu1bVJlX1MOAvW+yUkWamYfrnUScwhfk69pnPY5n5Oj5Zq6ruadtLqup1VfWtlvufjTKxabgK2BI4CtgRuCzJl5Lsn+Qho01tSmu120geQjcLY8MWX5/7/jnqmwOBvarqbcAzgcdV1T/S3eZ1zEgzW4UkGyZ5R5IfJrkxyQ1JLm+xjUad3+pK8sVh9eUaGMO11tjUyar6aZJdgdOSPIJ+/xKfqbuTPLD90t9xLNi+Kez7L/2ZWCvJxnSFwIxV2qvq9iT3TH3qvPIR4Ergv4C/SfIi4OVVdRewy0gzm10fBb4APIhumvDH6ar6zwf+D7D3qBKbZRsDGwFfS/JLum+KPlVVvxhpVrPv43TfOO4JvJTuv+vJwD8leXRV/cMok5tF21TVOwcDVfVL4J1J/mZEOWkOJLl4skPAZsPMZYbm69hnPo9l5uv45AcDs4m+n2RJVS1P8mig72sxVFX9Hvgy8OUk69LdgrQv8O/AolEmN4UPAz8E1gb+ETg1yVV047uTR5nYNKxDd+vI+nS3vlBVP28/+z47Bfgq3RcPv4TuSzRg/3ZsjxHmNqUkk82YC92MteHk4S0kw5Pkq8Drq+qigdg6wAnAK6pq7VHlNpuSrN/+cTs+vimweVVdMoK0Zl2Sn9INYkI3bftpVXVtkgcD36qq7UeY3qxJctHgZ0nyj3T/sH8ecGZfp//OVJILq+rJbfvnVbX1wLGLFtB/zz9M2U7y3+kGVy+km979yao6fpT5zZYk36+qJw3sn19V/y3JWsBlVfXYEaY3a5J8GfgKcGJV/arFNgNeRfet6zNHmJ5mUZJf0RXkbhp/CPhOVT18+Fmt2nwd+8znscx8HZ+04tB7gP8O/Jpu/Yur2+u1VfX9EaY3pcExxATHxgphvZTk4QBV9Ys2A+CZwM+r6ryRJjaFNovrAOBcuj8v76yqjyRZBPxXVT19pAlOIckVVfWYmR7rgyT3Al9n4uLzLlW1wTDycAbGcO0H3Kfy3abK7ZfkA6NJafZN9Au/xX9N9wtpQaiqbSY59Hu6e/AWivWTrNW+WaCqjk5yDd19hw8ebWqzavCWupPGHevlAPv+qqpvAt9M8nd0C9O9DFgQBQzg9iR/UVXfSvI84EaAqvp9u7d3oXgZ3UJ3X2+FiwJ+BSyjm3mihePzwIMHCwFjkpw99Gymb16OfebzWGa+jk/aIp2vSvJQYFu6f6esHCvO9tzLJjvQ5+IFdIWLge2b6dYd6bWqek+SrwB/Dryrqn7Y4tcDvS1eND9L8vdM/MVD3xeBvRz4H1V15fgDSYaWuzMwJE0pyb8CX66qr4yLLwXeW1WLR5PZ7EpyJPCvNbBCfos/im4xxBePJrPZleTkqtpn1HnMtSRPBD4ELAYuBf6mqn7Uvp3Zt6qOHWmCsyjdEx62BM6pefKEB0mS1kTt9q7D6G5N/pMWHvvi4R1VNX6WXW8keTFwSVVdMcGx51fV/x1KHhYwJK2u9Hjl+9nk51xYFtLnzDx9woMkSbqv+Tw+GWbuFjAkrbbxa0UsVH7OhWUhfc50j3p8SlXdlmQbuqm/H2vTaye9J1uSJPXLfB6fDDN318CQNKV5vPL9jPg5/Zzz1Hx9woMkSWuc+Tw+6UvuFjAkrcpmTLHy/fDTmTN+Tj/nfPSrJNuPLezYZmI8h+4JD08YaWaSJGm8+Tw+6UXuFjAkrcp8Xfl+pvycfs75aF4+4UGSpDXUfB6f9CJ318CQJEmSJEm9t9aoE5AkSZIkSVoVCxiSJEmSJKn3LGBIGrokM1roJ8muST4/V/lIkqT5zbGFtGawgCFp6KrqqaPOQZIkLRyOLaQ1gwUMSUOX5Lb2vmuSs5OcluSHST6eJO3Y0hb7HvDCgXMflOSEJOcluTDJ3i3+niRvadt7JvlGEv+OkyRpDeDYQloz+BhVSaP2ZOBxwC+AbwNPS7Ic+CCwG7AC+NRA+38EvlpVf5NkI+C8JF8BDgfOT/JN4Fjg2VX1++F9DEmS1BOOLaQFygqipFE7r6pWtgHBRcA2wGOBn1TVldU96/k/B9rvARyW5CLgbOABwNZVdQdwIHAm8B9V9eOhfQJJktQnji2kBcoZGJJG7a6B7XtZ9d9LAV5UVVdMcOwJwA3Aw2cpN0mSNP84tpAWKGdgSOqjHwLbJHlk29934NgZwN8N3M/65Pb+COANdNNG90qy8xDzlSRJ/ebYQloALGBI6p2quhM4CPhCW2jruoHDRwHrAhcnuRQ4qg04Pgy8sap+ARwAfCjJA4acuiRJ6iHHFtLCkO4WMEmSJEmSpP5yBoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiasSSvSvKtgf3bkvzZKs7ZJkklWWfuM5w0h1ck+fKo+pckaaHo81ggydOSXNlyev5c9iVpuEb2DwlJC0dVPXjUOUxHVX0c+Pio85AkaaHp2VjgSOA/quo9o05kvCRvBR5VVX896lyk+cgZGJLmjVHO3pAkSaM3zbHAI4BLZ+v6Sda+v9eQNDssYEiaUpKtknw6yfVJbkjyHxO0qSSPatsbJHlXkp8luSXJt5JsMME5L0ry0ySPn6LvsammByT5OfDVFv+bJJcnuSnJGUkeMXDOHkmuaH2/P8nXk/xtOzZ+uutTk5zf2p6f5KkDx85OclSSbyf5TZIvJ9l0NX+MkiTNW/NpLJDkx8CfAZ9rt5Csn2TDJB9Ocm2Sa5K8bawo0cYG305yTJIbgLcm+WiS45KcnuR24C+TPDzJf7WfwU+SvHYgx7cmOS3Jfya5FXjVJJ9lKfAPwMtabt9P8pIkF4xr9/okn23bH03yf5Kc2cYjXx837nlsO3ZjG/+8dLKfpbQQWMCQNKn2y/3zwM+AbYAtgJNXcdq/AzsCTwU2Af4e+P24674aeCfwzKr6wTRSeQbw58CeSfam++X/QmAR8E3gk+26mwKnAYcDDwOuaHlM9Nk2Ab4AHNvavhv4QpKHDTR7OfBq4E+A9YA3TiNXSZIWjPk2FqiqRwI/B55bVQ+uqruAjwL3AI8CngzsAfztwLV3Bq4CNgOObrGXt+2HAN8BPgd8v33+3YHXJdlz4Bp7041BNmKS21Wr6kvA24FPtdyeBCwDtk3y5wNNXwmcNLD/CuAoYFPgorHrJ3kQcCbwCbqxyj7A+5NsN+FPUFoALGBImspOwMOBN1XV7VV1Z1V9a7LGSdYC/gY4tKquqap7q+o7bfAw5nXAm4Bdq2rFNPN4a+v/t8D/BP6lqi6vqnvoBgLbt28jng1cWlWfbseOBX45yTX/Criyqj5WVfdU1SeBHwLPHWjzkar6Uev3FGD7aeYrSdJCMd/GAuPz2YxufPC6dv51wDF0/9gf84uqem8bD/y2xT5bVd+uqt8DTwAWVdWRVXV3VV0FfHDcNb5bVf+3qn4/cI1Vaj+XTwF/3fJ9HF2h6PMDzb5QVd9obf8ReEqSrYDnAD+tqo+03C8E/gt4yXT7l+YbCxiSprIV8LM2OJiOTYEHAD+eos2bgPdV1coZ5HH1wPYjgPckuTnJzcCNQOi+EXn4YNuqKmCyfh5O923SoJ+164wZLH7cAfRpgTJJkoZhvo0FxnsEsC5w7UD7D9DNWJjo2pP19/Cx89s1/oFuxsZU15iuE4GXJwnd7ItTxhV8Bsc2t9F93oe3vHYel9crgD+9H7lIveYCM5KmcjWwdZJ1pjlw+TVwJ/BIummWE9kD+FKSX1bVf00zjxqX09HtiSL3kWQxsOXAfgb3x/kF3S/+QVsDX5pmTpIkrQnm1VhgAlcDdwGbTpF/rSJ2NfCTqlo8zfym8kftquqcJHcD/53u1pWXj2uy1dhGkgfT3Zbzi5bX16vqWdPsW5r3nIEhaSrnAdcC70jyoCQPSPK0yRq3aZYnAO9ui12tneQpSdYfaHYpsBR4X5LnrUZO/wc4vE2xpC3MNTZV8gvAE5I8P90K4Acz+bcQpwOPTvLyJOskeRmwHfedsilJ0ppuvo0FxudzLfBl4F1JHppkrSSPTPKMGfR3HvCbJG9Ot0Dp2kken+S/rUbuvwK2abfaDDoJ+A/gdxPcovPsJH+RZD26tTDOqaqr6cYsj07yyiTrttd/G7eehrSgWMCQNKmqupduTYhH0S2ItRJ42SpOeyNwCXA+3RTHdzLu75qq+j7dfZsfTLLXDHP6TLvmyW2l7x8Ae7Vjv6a77/NfgRvoChLL6b55GX+dG1oOb2ht/x54TruGJEli/o0FJrEf3WLclwE30S22ufkM+ru35bo98BO6WSYfAjacSd7Nqe39hiTfG4h/DHg88J8TnPMJ4Ai6n+WOtPUyquo3dLNZ9qGbkfFLup/L+hNcQ1oQ0t0iLkkLT/t2YyXwiqr62qjzkSRJmki6x8xeB+xQVVcOxD8KrKyqfxpVblKfOAND0oKSZM8kG7Wpqv9At6jXOSNOS5IkaSr/Czh/sHgh6Y9ZwJA0UklekeS2CV6XruYln0K38vmv6aa8Pn8mjzOTJEnDNQdjgZFK8sVJPs8/TNL+p8ChdLe1SpqCt5BIkiRJkqTecwaGJEmSJEnqvXVGnUCfbLrpprXNNtuMOg1JkuaVCy644NdVtWjUefSN4wpJklbPZGMLCxgDttlmG5YvXz7qNCRJmleS/GzUOfSR4wpJklbPZGMLbyGRJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJ80qSjZKcluSHSS5P8pQkmyQ5M8mV7X3j1jZJjk2yIsnFSXYYuM7+rf2VSfYfiO+Y5JJ2zrFJ0uIT9iFJkoZjTgsYSR6Q5Lwk309yaZJ/bvFtk5zbBgafSrJei6/f9le049sMXOvwFr8iyZ4D8aUttiLJYQPxCfuQJEnz3nuAL1XVY4EnAZcDhwFnVdVi4Ky2D7AXsLi9DgKOg64YARwB7AzsBBwxUJA4Djhw4LylLT5ZH5IkaQjm+jGqdwG7VdVtSdYFvpXki8DrgWOq6uQk/wc4gG6wcABwU1U9Ksk+wDuBlyXZDtgHeBzwcOArSR7d+ngf8CxgJXB+kmVVdVk7d6I+JElrmFNO3WnUKfyRl77kvFGnMC8l2RB4OvAqgKq6G7g7yd7Arq3ZicDZwJuBvYGTqqqAc9rsjc1b2zOr6sZ23TOBpUnOBh5aVee0+EnA84EvtmtN1IdW08+PfMKoU5h1W7/lklGnIEkL1pzOwKjObW133fYqYDfgtBY/kW5gAN3A4MS2fRqwe5u2uTdwclXdVVU/AVbQfVuyE7Ciqq5qA5iTgb3bOZP1IUmS5q9tgeuBjyS5MMmHkjwI2Kyqrm1tfgls1ra3AK4eOH9li00VXzlBnCn6kCRJQzDna2AkWTvJRcB1wJnAj4Gbq+qe1mRwYPCHwUQ7fgvwMGY++HjYFH2Mz++gJMuTLL/++uvvxyeVJElDsA6wA3BcVT0ZuJ1xt3K02RY1l0lM1ofjCkmS5s6cFzCq6t6q2h7Ykm7GxGPnus+ZqKrjq2pJVS1ZtGjRqNORJElTWwmsrKpz2/5pdAWNX7VbQ2jv17Xj1wBbDZy/ZYtNFd9ygjhT9PEHjiskSZo7Q3sKSVXdDHwNeAqwUZKx9TcGBwZ/GEy04xsCNzDzwccNU/QhSZLmqar6JXB1kse00O7AZcAyYOxJIvsDn23by4D92tNIdgFuabeBnAHskWTjtnjnHsAZ7ditSXZpt6TuN+5aE/UhSZKGYK6fQrIoyUZtewO6xTYvpytkvLg1Gz/IGBsYvBj4apuiuQzYpz2lZFu6FcHPA84HFrcnjqxHt9DnsnbOZH1IkqT57e+Ajye5GNgeeDvwDuBZSa4Entn2AU4HrqJbP+uDwGsA2uKdR9GNJc4Hjhxb0LO1+VA758d0C3gyRR+SJGkI5vopJJsDJyZZm65YckpVfT7JZcDJSd4GXAh8uLX/MPCxJCuAG+kKElTVpUlOofuG5R7g4Kq6FyDJIXTfoqwNnFBVl7ZrvXmSPiRJ0jxWVRcBSyY4tPsEbQs4eJLrnACcMEF8OfD4CeI3TNSHJEkajjktYFTVxcCTJ4hfRbcexvj4ncBLJrnW0cDRE8RPp/t2ZVp9SJIkSZKk+Wdoa2BIkiRJkiStLgsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkaV5J8tMklyS5KMnyFtskyZlJrmzvG7d4khybZEWSi5PsMHCd/Vv7K5PsPxDfsV1/RTs3U/UhSZKGY04LGEm2SvK1JJcluTTJoS3+1iTXtIHHRUmePXDO4W3AcEWSPQfiS1tsRZLDBuLbJjm3xT+VZL0WX7/tr2jHt5nLzypJkobqL6tq+6pa0vYPA86qqsXAWW0fYC9gcXsdBBwHXTECOALYGdgJOGKgIHEccODAeUtX0YckSRqCuZ6BcQ/whqraDtgFODjJdu3YMW3gsX1VnQ7Qju0DPI5usPD+JGsnWRt4H90gZDtg34HrvLNd61HATcABLX4AcFOLH9PaSZKkhWlv4MS2fSLw/IH4SdU5B9goyebAnsCZVXVjVd0EnAksbcceWlXnVFUBJ4271kR9SJKkIZjTAkZVXVtV32vbvwEuB7aY4pS9gZOr6q6q+gmwgu5bkZ2AFVV1VVXdDZwM7N2mdO4GnNbOHz9gGRtknAbsPjYFVJIkzWsFfDnJBUkOarHNquratv1LYLO2vQVw9cC5K1tsqvjKCeJT9fEHSQ5KsjzJ8uuvv361PpwkSZrY0NbAaLdwPBk4t4UOafeinjAwZXOmg4yHATdX1T3j4ve5Vjt+S2s/Pi8HGpIkzS9/UVU70M3MPDjJ0wcPtpkTNZcJTNZHVR1fVUuqasmiRYvmMgVJktY4QylgJHkw8F/A66rqVrp7Sx8JbA9cC7xrGHlMxIGGJEnzS1Vd096vAz5DN1PzV+32D9r7da35NcBWA6dv2WJTxbecIM4UfUiSpCGY8wJGknXpihcfr6pPA1TVr6rq3qr6PfBBuoEHzHyQcQPdvazrjIvf51rt+IatvSRJmqeSPCjJQ8a2gT2AHwDLgLEniewPfLZtLwP2a08j2QW4pd0GcgawR5KN20zQPYAz2rFbk+zSbj3db9y1JupDkiQNwVw/hSTAh4HLq+rdA/HNB5q9gG7gAd3AYJ/2BJFt6Vb+Pg84H1jcnjiyHt1Cn8va9M2vAS9u548fsIwNMl4MfLW1lyRJ89dmwLeSfJ9ujPCFqvoS8A7gWUmuBJ7Z9gFOB66iW1frg8BrAKrqRuAoujHG+cCRLUZr86F2zo+BL7b4ZH1IkqQhWGfVTe6XpwGvBC5JclGL/QPdU0S2p7t39KfA/wCoqkuTnAJcRvcEk4Or6l6AJIfQfVuyNnBCVV3arvdm4OQkbwMupCuY0N4/lmQFcCNd0UOSJM1jVXUV8KQJ4jcAu08QL+DgSa51AnDCBPHlwOOn24ckSRqOOS1gVNW3gIme/HH6FOccDRw9Qfz0ic5rA5mdJojfCbxkJvlKkqThSPJoujWxNquqxyd5IvC8qnrbiFOTJEk9NbSnkEiSJA34IHA48DuAqroYZ0tKkqQpWMCQJEmj8MCqOm9c7J4JW0qSJGEBQ5IkjcavkzySbj0skryY7tHqkiRJE5rrRTwlSZImcjBwPPDYJNcAPwFeMdqUJElSn1nAkCRJQ9cW4X5mkgcBa1XVb0adkyRJ6jdvIZEkSUOX5GFJjgW+CZyd5D1JHjbqvCRJUn9ZwJAkSaNwMnA98CLgxW37UyPNSJIk9Zq3kEiSpFHYvKqOGth/W5KXjSwbSZLUe87AkCRJo/DlJPskWau9XgqcMeqkJElSf1nAkCRJo3Ag8AngrvY6GfgfSX6T5NaRZiZJknrJW0gkSdLQVdVDRp2DJEmaX5yBIUmShi7JfyV5dhLHIpIkaVocNEiSpFE4DngFcGWSdyR5zKgTkiRJ/WYBQ5IkDV1VfaWqXgHsAPwU+EqS7yR5dZJ1R5udJEnqIwsYkiRpJJI8DHgV8LfAhcB76AoaZ44wLUmS1FMzWsQzyRbAIwbPq6pvzHZSkiRpYUvyGeAxwMeA51bVte3Qp5IsH11mw7Hjm04adQqz7oJ/22/UKUiSFrhpFzCSvBN4GXAZcG8LF2ABQ5IkzdQHq+r0wUCS9avqrqpaMqqkJElSf83kFpLnA4+pqmdX1XPb63lTnZBkqyRfS3JZkkuTHNrimyQ5M8mV7X3jFk+SY5OsSHJxkh0GrrV/a39lkv0H4jsmuaSdc2ySTNWHJEnqhbdNEPvudE5MsnaSC5N8vu1vm+TcNhb4VJL1Wnz9tr+iHd9m4BqHt/gVSfYciC9tsRVJDhuIT9iHJEkanpkUMK4CZrqo1j3AG6pqO2AX4OAk2wGHAWdV1WLgrLYPsBewuL0OoluhnCSbAEcAOwM7AUcMFCSOAw4cOG9pi0/WhyRJGpEkf5pkR2CDJE9OskN77Qo8cJqXORS4fGD/ncAxVfUo4CbggBY/ALipxY9p7WhjkX2Ax9GNG97fiiJrA++jG49sB+zb2k7VhyRJGpKZFDDuAC5K8oE20+HYJMdOdUJVXVtV32vbv6EbbGwB7A2c2JqdSDe7gxY/qTrnABsl2RzYEzizqm6sqpvoFvda2o49tKrOqaoCThp3rYn6kCRJo7Mn8O/AlsC7Bl7/H/APqzo5yZbAXwEfavsBdgNOa03GjyvGxgKnAbu39nsDJ7fbVX4CrKD7gmQnYEVVXVVVdwMnA3uvog9JkjQkM1nEc1l7rZY2bfPJwLnAZgOLdf0S2KxtbwFcPXDayhabKr5ygjhT9DE+r4PoZnuw9dZbz/RjSZKkGaiqE4ETk7yoqv5rsnZJ9m9tx/vfwN8DD2n7DwNurqp72v7gWOAP44equifJLa39FsA5A9ccPGf8eGPnVfQxPm/HFZIkzZFpz8Bog4hPAhe01ycmGVj8kSQPBv4LeF1V3TruukW3GOicmaqPqjq+qpZU1ZJFixbNZRqSJKmZqnjRHDo+kOQ5wHVVdcHcZHX/Oa6QJGnuTLuA0e5NvZLu3tD3Az9K8vRpnLcuXfHi41X16Rb+Vbv9g/Z+XYtfA2w1cPqWLTZVfMsJ4lP1IUmS+i8TxJ4GPC/JT+lu79gNeA/dLadjs0oHxwJ/GD+04xsCNzDz8cYNU/QhSZKGZCZrYLwL2KOqnlFVT6e7h/WYqU5o94x+GLi8qt49cGgZMPYkkf2Bzw7E92tPI9kFuKXdBnIGsEeSjdvinXsAZ7RjtybZpfW137hrTdSHJEnqvz+aOVlVh1fVllW1Dd0inF+tqlcAXwNe3JqNH1eMjQVe3NpXi+/TnlKyLd0i4OcB5wOL2xNH1mt9LGvnTNaHJEkakpmsgbFuVV0xtlNVP2qzK6byNOCVwCVJLmqxfwDeAZyS5ADgZ8BL27HTgWfTLaZ1B/Dq1teNSY6iG1gAHFlVN7bt1wAfBTYAvtheTNGHJEnqv4lmYEzmzcDJSd4GXEj35Qnt/WNJVgA30hUkqKpLk5wCXEb3xLSDq+pegCSH0H1xsjZwQlVduoo+JEnSkMykgLE8yYeA/2z7rwCWT3VCVX2LyQcgu0/QvoCDJ7nWCcAJE8SXA4+fIH7DRH1IkqTRSrIW8OKqOmWKZt+e6hpVdTZwdtu+iu4JIuPb3Am8ZJLzjwaOniB+Ot0XKuPjE/YhSZKGZya3kPwvum8qXttel7WYJEnStFXV7+meJDJVm0OGlI4kSZonpj0Do6ruAt7dXpIkSffHV5K8EfgUcPtYcOAWUUmSpPtYZQEjySlV9dIklzDxglpPnJPMJEnSQvay9j5462gBfzaCXCRJ0jwwnRkYY89hf85cJiJJktYcVbXtqHOQJEnzyyoLGO1RpVTVz+Y+HUmStCZI8kDg9cDWVXVQksXAY6rq8yNOTZIk9dS0F/FM8pskt457XZ3kM0mc7ilJkmbiI8DdwFPb/jXA20aXjiRJ6ruZPEb1fwMrgU/QPRp1H+CRwPfoHm+66yznJkmSFq5HVtXLkuwLUFV3JJns0euSJEkzeozq86rqA1X1m6q6taqOB/asqk8BG89RfpIkaWG6O8kGtAXCkzwSuGu0KUmSpD6bSQHjjiQvTbJWe70UuLMd+6Onk0iSJE3hCOBLwFZJPg6cBfz9aFOSJEl9NpNbSF4BvAd4P13B4hzgr9u3J4fMQW6SJGkBSrIW3ezNFwK70N2aemhV/XqkiUmSpF6bdgGjqq4CnjvJ4W/NTjqSJGmhq6rfJ/n7qjoF+MKo85EkSfPDTJ5C8ugkZyX5Qdt/YpJ/mrvUJEnSAvaVJG9MslWSTcZeo05KkiT110xuIfkg8CbgAwBVdXGST+AjzyRpRi4/+qujTuGP/Pk/7jbqFLTmeVl7P3ggVoCPZpckSROaSQHjgVV13rgnnN0zy/lIkqQFrq2BcVh7kpkkSdK0zOQpJL9ujzgbe9zZi4Fr5yQrSZK0YFXV7+lmdUqSJE3bTGZgHAwcDzw2yTXAT+ieTCJJkjRTX0nyRuBTwO1jwaq6cXQpSZKkPptWASPJ2sBrquqZSR4ErFVVv5nb1CRJ0gLmGhiSJGlGpnULSVXdC/xF277d4oUkSbo/qmrbCV6rLF4keUCS85J8P8mlSf65xbdNcm6SFUk+lWS9Fl+/7a9ox7cZuNbhLX5Fkj0H4ktbbEWSwwbiE/YhSZKGYyZrYFyYZFmSVyZ54dhrqhOSnJDkurFHr7bYW5Nck+Si9nr2wLFZGUhMNViRJEmjl+SBSf4pyfFtf3GS50zj1LuA3arqScD2wNIkuwDvBI6pqkcBNwEHtPYHADe1+DGtHUm2A/YBHgcsBd6fZO026/R9wF7AdsC+rS1T9CFJkoZgJgWMBwA3ALsBz22vVQ00Pko3KBjvmKravr1Oh1kfSEw4WJEkSb3xEeBu4Klt/xqm8Wj26tzWdtdtr6Ibn5zW4icCz2/be7d92vHd0z1SbW/g5Kq6q6p+AqwAdmqvFVV1VVXdDZwM7N3OmawPSZI0BNMuYFTVqyd4/c3Y8SSHT3DON4DpLsY1mwOJyQYrkiSpHx5ZVf8K/A6gqu4ApvW7un3BcRFwHXAm8GPg5qoae7z7SmCLtr0FcHXr4x7gFuBhg/Fx50wWf9gUfQzmdlCS5UmWX3/99dP5OJIkaZpmMgNjVV4yg7aHJLm43WKycYvN5kBissHKH3GgIUnSSNydZAP+3+PZH0l3e8gqVdW9VbU9sCXdFx2PnaskZ6qqjq+qJVW1ZNGiRaNOR5KkBWU2CxjTneFwHPBIuvtWrwXeNYs5zJgDDUmSRuII4EvAVkk+DpwF/P1MLlBVNwNfA54CbJRk7OlqW9LdkkJ73wqgHd+Q7pbYP8THnTNZ/IYp+pAkSUMwmwWMmlajql+1b05+D3yQ7psTmN2BxGSDFUmSNEJJntY2vwG8EHgV8ElgSVWdPY3zFyXZqG1vADwLuJyukPHi1mx/4LNte1nbpx3/alVVi+/TFv7eFlgMnAecDyxuC4WvR7c+17J2zmR9SJKkIRj6DIwkmw/svgAYe0LJbA4kJhusSJKk0Tq2vX+3qm6oqi9U1eer6tfTPH9z4GtJLqYbI5xZVZ8H3gy8PskKuttGP9zafxh4WIu/HjgMoKouBU4BLqObCXJw+4LlHuAQ4Ay6wsgprS1T9CFJkoZgnVU36STZpKpuHBfbti24CXDqBOd8EtgV2DTJSrrporsm2Z5uxsZPgf8B3UAiydhA4h7aQKJdZ2wgsTZwwriBxMlJ3gZcyH0HKx9rA4wb6YoekiRp9H7XHp26ZZJjxx+sqtdOdXJVXQw8eYL4Vfy/WZ2D8TuZZJ2uqjoaOHqC+OnA6dPtQ5IkDce0CxjA55LsVVW3wh8ee3oK8HiAqnr7+BOqat8JrjPptxWzNZCYarAiSZJG6jnAM4E9gQtGnIskSZpHZlLAeDtdEeOvgMcAJwGvmJOsJEnSgtRuFTk5yeVV9f1R5yNJkuaPaRcwquoLSdYFvgw8BHhBVf1ozjKTJEkL2W+TnAVsVlWPT/JE4HlV9bZRJyZJkvpplQWMJO/lvk8Y2RD4MXBIklXeqypJkjSBDwJvAj4A3doWST4BWMCQJEkTms4MjOXj9r1fVZIk3V8PrKrzkvs8xOyeUSUjSZL6b5UFjKo6ESDJg4A7B54Msjaw/tymJ0mSFqhfJ3kkbZZnkhcD1442JUmS1GdrzaDtWcAGA/sbAF+Z3XQkSdIa4mC620cem+Qa4HXA/xxpRpIkqddm8hSSB1TVbWM7VXVbkgfOQU7SvPT1pz9j1Cn8kWd84+ujTkGS/kibxfmaqnpmm+G5VlX9ZtR5SZKkfpvJDIzbk+wwtpNkR+C3s5+SJElayNrtqH/Rtm+3eCFJkqZjJjMwXgecmuQXQIA/BV42F0lJkqQF78Iky4BTgdvHglX16dGlJEmS+mzaBYyqOj/JY4HHtNAVVfW7uUlLkiQtcA8AbgB2G4gVYAFDkiRNaJUFjCS7VdVXk7xw3KFHJ/GbEkmSNGNV9eqpjic5vKr+ZVj5SJKk/pvODIxnAF8FnjvBMb8pkSRJc+ElgAUMSZL0B6ssYFTVEe19ym9KJEmSZlFGnYAkSeqX6dxC8vqpjlfVu2cvHUmSJKCb5SlJkvQH07mF5CFTHHNwIUmS5oIzMCRJ0n1M5xaSfwZIciJwaFXd3PY3Bt41p9lJkqQFKckmVXXjuNi2VfWTtnvqCNKSJEk9ttYM2j5xrHgBUFU3AU+e9YwkSdKa4HNJHjq2k2Q74HNj+1X19pFkJUmSemsmBYy12qwLoPvmhFXM4EhyQpLrkvxg8LwkZya5sr1v3OJJcmySFUkuTrLDwDn7t/ZXJtl/IL5jkkvaOccmyVR9SJKk3ng7XRHjwUl2pJtx8derOinJVkm+luSyJJcmObTFHV9IkrTAzaSA8S7gu0mOSnIU8B3gX1dxzkeBpeNihwFnVdVi4Ky2D7AXsLi9DgKOgz8USo4AdgZ2Ao4YGDAcBxw4cN7SVfQhSZJ6oKq+ABwDfJluvPCCqrpoGqfeA7yhqrYDdgEObrM3HF9IkrTATbuAUVUnAS8EftVeL6yqj63inG8AN44L7w2c2LZPBJ4/ED+pOucAGyXZHNgTOLOqbmy3rZwJLG3HHlpV51RVASeNu9ZEfUiSpBFK8t42q+FYYDdgQ+AnwCEtNqWquraqvte2fwNcDmyB4wtJkha86TyF5A+q6jLgsvvZ52ZVdW3b/iWwWdveArh6oN3KFpsqvnKC+FR9/JEkB9F9I8PWW289088iSZJmZvm4/QtW90JJtqFbj+tcejK+cFwhSdLcmVEBY7ZVVSWZ00exrqqPqjoeOB5gyZIlPhZWkqQ5VFUnAiR5EHBnVd3b9tcG1p/udZI8GPj/t3fn4ZZU9b3/3x/BAUdACEGBgAYTiQNKB7ghURSFhqioQYMmgoRIcoXEJMYbHJ6AGnIhBv2JGnJROkAcEIiG1mBaggNxQLpBZJTQokgTpgCCiBPw/f1R6+jmcM7pbuizq/bp9+t59rNrf2tV1XcfmlN1vrVqrX8B/qyq7mjDVEwdo7frC68rJEmaP2szBsa6cmPrnkl7v6nFrwO2Hmm3VYvNFd9qhvhcx5AkScNwDrDRyOeNgP9Ykw2TPJSuePGRqvpEC3t9IUnSAtdHAWMpMDXS94HAmSPxA9po4bsCt7dumsuAPZNs0gbX2hNY1tbdkWTXNjr4AdP2NdMxJEnSMDyiqu6c+tCWH7m6jdo5/0Tgiqp698gqry8kSVrg5vURkiQfA3YHNkuyim6076OB05IcDFwDvLI1PwvYB1gJ3AUcBFBVt7ZZT5a3du+oqqmBQV9PN3L5RsBn2os5jiFJkobhB0mePTUgZ5tK9YdrsN1uwGuAS5Jc1GJvwesLSZIWvHktYFTVq2ZZtccMbQs4dJb9LAGWzBBfATxthvgtMx1DkiQNxp8Bpyf5byDALwK/u7qNqupLrf1MvL6QJGkB63UQT0mStH6qquVJfhX4lRa6sqp+2mdOkiRp2CxgSJKksUny/Kr6XJKXT1v1lCSMDMopSZJ0HxYwJEnSOD0X+Bzw4hnWFWABQ5IkzcgChiRJGpuqOqK9H9R3LpIkabJYwJAkSWOT5C/mWj9talRJkqSfsYAhSZLG6TFzrKuxZSFJkiaOBQxJkjQ2VfV2gCQnA2+oqu+1z5sAx/aYmiRJGriH9J2AJElaLz1jqngBUFW3Ac/qLx1JkjR0FjAkSVIfHtJ6XQCQZFPsGSpJkubghYIkSerDscBXk5zePr8COKrHfCRJ0sBZwJAkSWNXVackWQE8v4VeXlWX95mTJEkaNgsYkiSpF61gYdFCkiStEcfAkCRJkiRJg2cPjAXqu+94et8p3M82f31J3ylIkiRJkiaUPTAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNnmNgSOu597/xU32ncD+HHfvivlOQNFBJlgAvAm6qqqe12KbAx4Ftge8Ar6yq25IEeC+wD3AX8NqqurBtcyDwtrbbv6mqk1t8J+AkYCPgLOANVVWzHWOev64kSRrRWw+MJN9JckmSi9o88CTZNMnZSa5q75u0eJIcl2RlkouTPHtkPwe29le1i5Gp+E5t/yvbthn/t5QkSevYScDiabHDgXOqanvgnPYZYG9g+/Y6BDgeflbwOALYBdgZOGLqmqO1ed3IdotXcwxJkjQmfT9C8ryq2rGqFrXP47gAkSRJE6qqzgVunRbeFzi5LZ8MvHQkfkp1zgM2TrIlsBdwdlXd2npRnA0sbuseW1XnVVUBp0zb10zHkCRJY9J3AWO6cVyASJKkhWWLqrq+Ld8AbNGWnwhcO9JuVYvNFV81Q3yuY0iSpDHps4BRwGeTXJDkkBYbxwXIfSQ5JMmKJCtuvvnmB/N9JElSz9qNi+rrGF5XSJI0f/osYPxmVT2b7vGQQ5M8Z3TlOC5A2nFOqKpFVbVo8803n+/DSZKkde/G1vuS9n5Ti18HbD3SbqsWmyu+1QzxuY5xH15XSJI0f3orYFTVde39JuCTdGNYjOMCRJIkLSxLgamBvA8EzhyJH9AGA98VuL319FwG7JlkkzZ21p7AsrbujiS7tsG/D5i2r5mOIUmSxqSXaVSTPAp4SFV9vy3vCbyDn18cHM39L0AOS3Iq3YCdt1fV9UmWAX87MnDnnsCbq+rWJHe0i5Wv0V2AvG9c30+SpHXhmWcs6zuF+/nGfnv1evwkHwN2BzZLsopuMO+jgdOSHAxcA7yyNT+LbgrVlXTTqB4E0K4T3gksb+3eUVVTA4O+np9Po/qZ9mKOY0iSpDHppYBBN7bFJ9vMphsCH62qf0+ynPm/AJG0QBz1+/v1ncL9vPXDZ/Sdwrw58sgj+05hRkPNS/Ojql41y6o9ZmhbwKGz7GcJsGSG+ArgaTPEb5npGJIkaXx6KWBU1dXAM2eIz3hxsC4vQNbGTm865cFsPm8ueNcBfacgSZIkSdJYDW0aVUmSJEmSpPuxgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnw+pqFRJrRbu/bre8U7ufLf/LlvlOQJEmSpPWePTAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZIkSZI0eA7iKUmSJEkD8v43fqrvFNa5w459cd8paAGwB4YkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBcwwMSZIkSZIG7IqjPtd3CuvcU9/6/LXexh4YkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBm9Bj4GRZDHwXmAD4ENVdXTPKUmSpAnmtYUkjddRv79f3ymsc2/98Bl9pzCxFmwBI8kGwAeAFwKrgOVJllbV5f1mJkmSJpHXFtL8++Jzntt3Cuvcc8/9Yt8pSAvGQn6EZGdgZVVdXVU/AU4F9u05J0mSNLm8tpAkqUepqr5zmBdJ9gMWV9Ufts+vAXapqsOmtTsEOKR9/BXgynlKaTPgf+Zp3/NpUvOGyc19UvOGyc19UvOGyc3dvMdvPnP/parafJ72PRhrcm0xxuuKtTHJ/27XJX8OHX8OP+fPouPPoePPoTOUn8OM1xYL9hGSNVVVJwAnzPdxkqyoqkXzfZx1bVLzhsnNfVLzhsnNfVLzhsnN3bzHb5JznyTjuq5YG/637/hz6Phz+Dl/Fh1/Dh1/Dp2h/xwW8iMk1wFbj3zeqsUkSZIeCK8tJEnq0UIuYCwHtk+yXZKHAfsDS3vOSZIkTS6vLSRJ6tGCfYSkqu5OchiwjG6qsyVVdVmPKQ2qO+lamNS8YXJzn9S8YXJzn9S8YXJzN+/xm+TcB2GA1xZryv/2HX8OHX8OP+fPouPPoePPoTPon8OCHcRTkiRJkiQtHAv5ERJJkiRJkrRAWMCQJEmSJEmDZwFDkiRJkiQNngWMMUjym0n+IsmefecylyQPS3JAkhe0z69O8v4khyZ5aN/5rU6SJyX5yyTvTfLuJH+c5LF95yVJksYvya8m2SPJo6fFF/eVUx+S7Jzk19vyDu2adJ++8+pbklP6zmEIJuXvlHUtyS5Tfyck2SjJ25N8KskxSR7Xd37jkuRPk2y9+pbD4SCe8yDJ+VW1c1t+HXAo8ElgT+BTVXV0n/nNJslH6GameSTwPeDRwCeAPej+rRzYX3ZzS/KnwIuAc4F9gK/TfYeXAa+vqi/0lpyk+0nyC1V1U995SOuTJAdV1T/1ncc4tOuCQ4ErgB2BN1TVmW3dhVX17B7TG5skRwB7013fnQ3sAnweeCGwrKqO6jG9sUkyfbrjAM8DPgdQVS8Ze1I9mdS/U9a1JJcBz2yzS50A3AWcQfd3zzOr6uW9JjgmSW4HfgB8C/gYcHpV3dxvVnOzgDEPkny9qp7VlpcD+1TVzUkeBZxXVU/vN8OZJbm4qp6RZEPgOuAJVXVPkgDfqKpn9JzirJJcAuzY8n0kcFZV7Z5kG+DMqf8eml9JHl9Vt/Sdx1xaVf3NwEuBXwAKuAk4Ezi6qr7XW3IPUJLPVNXefecxmySbTg8BFwDPojsP3Tr+rFYvyeKq+ve2/Djg3cCvA5cCf15VN/aZ31ySLALeRfe7/M3AEmBn4L+AQ6rq6z2mp54k+W5VbdN3HuPQrgv+V1XdmWRbuj9M/rmq3jt6nbbQTV0fAQ8HbgC2qqo7kmwEfG3I13brUpILgcuBD9Gd90P3x9r+AFX1xf6yG69J/TtlXUtyRVU9tS3fp6iZ5KKq2rG35MYoydeBnYAXAL8LvITuGu1jwCeq6vs9pjcjHyGZHw9JskmSx9NdnN8MUFU/AO7uN7U5PSTJw4DH0PXCmOo+9XBg8I+Q0N1dgC7fRwNU1XcZcO5JfjHJ8Uk+kOTxSY5MckmS05Js2Xd+c0lydJLN2vKiJFcDX0tyTZLn9pzeXE4DbgN2r6pNq+rxdHdhbmvrBinJs2d57UR3cTpk/0N3Mpx6rQCeCFzYlofqb0eWjwWuB14MLAf+Xy8Zrbl/AP4O+DfgK8D/q6rHAYe3dVqgklw8y+sSYIu+8xujh1TVnQBV9R1gd2DvJO+m++N1fXF3Vd1TVXcB36qqOwCq6ofAvf2mNlaL6M4/bwVubz1zf1hVX1yfihfNpP6dsq5dmuSgtvyNVvgnyVOAn/aX1thVVd1bVZ+tqoOBJ9BdJywGru43tZltuPomegAeR/dLMkAl2bKqrm/PYA75pHki8E1gA7pf8Ke3P0p3BU7tM7E18CFgeZKvAb8FHAOQZHNgkHd3m5Po/sB4FF2Xzo/QPQLzUuAfgX37SmwN/HZVHd6W3wX8blUtb7/4P0p3sTBE21bVMaOBqroBOCbJH/SU05pYDnyRmX+HbDzeVNbam+i6K7+pqi4BSPLtqtqu37TWyqKRuzHvSTLYR+qah1bVZwCSHFNVZwBU1TlJ/r7f1DTPtgD2oivKjgpdMWt9cWOSHavqIoDWE+NFdL2R1os7zM1PkjyyFTB2mgq2XmXrTQGjqu6l+919enu/kfX376BJ/TtlXftD4L1J3kZ3o+WrSa4Frm3r1hf3+W9eVT8FlgJLW6/2wfERkjFq/wi2qKpv953LbJI8AaCq/jvJxnTdib5bVef3mtgaSPJrwFOBS6vqm33nsyamdeO7T9feoXdfS3IF8PT27OB5VbXryLpLhtoFMclngf8ATp56BCDJFsBrgRdW1Qt6TG9WSS4FXlZVV82w7tqqGvQATEm2At5Dd2FwBN1jaU/qN6u5JVlF99hI6J4RfnK1k+bUI3d95jeXJF+l+zk/Dvh7uuf//7X1jjq2qoZaYNSDlORE4J+q6kszrPtoVb26h7TGrv3OubsVqKev262qvtxDWmOX5OFV9eMZ4psBW04Vldc3SX4b2K2q3tJ3LkMxCX+nzId0A3luR1fQWjXkx0PnQ5KnVNV/9Z3H2rCAIfUoyTeq6plt+W+q6m0j6wZbBABI8id03emPBp4DbEI36OvzgSdV1Wt6TG9WSTah60a/L92dygJupKs2HzPg8Rj2Ay6pqitnWPfSqvrX8We19pK8BHgLXU+YX+w7n7mkG/xu1D+054R/Efi7qjqgj7zWRJJn0j1Cci/w58D/Bg6kGxPjdVW1Pt2JlyRJC4QFDKlHSd5B94fQndPiv0w3oOR+/WS2ZpLsTveH0VPoKtfXAv8KLKmqwT5HmeRXga3oBqu6cyT+s0Ebh6jl/US6gdcmJm+4b+7APXS9GS4deu4T/jN/Kt2zrBOXuyRJ0kwsYEgDlQme7m7IuWdCp9ab1LxhcnNvvYwOY8Lyhp/9zF9PN67RjkxQ7pIkSbNZXwevkSbB24FBFgHWwJBzfx2wU41MrZdk26p6L8MevGpS84bJzf0QJjNv6H7miyY0d0laa0m+UlW/sRbtdwf+sqpeNG9JSVrnLGBIPUpy8WyrGPh0dxOc+32m1msXMGck+SWG/YfdpOYNk5v7pOYNk527JK21tSleSJpcD+k7AWk9twVwAN1gmNNft/SY15qY1NxvTLLj1If2R96LgM0Y9tR6k5o3TG7uk5o3THbukrTWktzZ3ndP8oUkZyT5ZpKPJElbt7jFLgRePrLto5IsSXJ+kq8n2bfF35vkr9vyXknOTeLfT1KPHAND6tEkT3c3qblP6tR6k5o3TG7uk5o3THbukvRAJLmzqh7depydCfwa8N/Al4E3ASuAq+hmS1sJfBx4ZFW9KMnfApdX1YeTbAycDzyLbqay5XTjIf0jsE9VfWuc30vSfVnAkCRJkjTRphUw3lpVL2zx4+mKGJcCx1XVc1r8JcAhrYCxAngEMDWD2qbAXlV1RZLfAM4F/ryq3jfWLyXpfhwDQ5IkSdJC8uOR5XtY/d88AX6nqq6cYd3T6R6NfcI6yk3Sg+AzXJIkSZIWum8C2yZ5cvv8qpF1y4A/GRkr41nt/ZeAN9I9TrJ3kl3GmK+kGVjAkDR2Sb6ylu13T/Lp+cpHkiQtbFX1I7rpsf+tDeJ508jqdwIPBS5OchnwzlbMOJFuqtX/Bg4GPpTkEWNOXdIIx8CQNHjO1S5JkiTJHhiSxs6pziRJkiStLQfxlNS3Z3Hfqc52a6OBf5D7TnU25a3A56rqD6amOkvyH8CbgeVJ/hM4jm6qs3vH9zUkSZIkzSfvTkrq2/lVtaoVGy4CtgV+Ffh2VV1V3XNuHx5pvydweJKLgC/QTXu2TVXdBbwOOBt4v/O0S5IkSQuLPTAk9c2pziRJkiStlj0wJA2RU51JkiRJug8LGJIGx6nOJEmSJE3nNKqSJEmSJGnw7IEhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoakiZbkC0n+sO88JEmSJM0vCxiSJEmSBinJa5N8aeTznUmetJpttk1SSTac/wwljZMFDEm98uJCkiStqap6dFVd3Xceo1qx5Jf7zkNaH1jAkDR2Sb6T5K+SXAz8IMnbknwryfeTXJ7kZSNtX5vkS0n+PsltSb6dZO9Z9rtlkouTvGlsX0aSJEnSWFjAkNSXVwG/DWwMXAn8FvA44O3Ah5NsOdJ2l9ZmM+DvgBOTZHRnSbYDvgi8v6reNe/ZS5KkdSrJ1kk+keTmJLckef8MbX7W2yHJRkmOTXJNktvbDY+NZtjmd9rNk6et5vi/meQrSb6X5Nokr23xk5J8IMm/tZstX0vy5Lbu3Lb5N9rjLb+bZLMkn277uTXJfybx7y5pHfB/JEl9Oa6qrq2qH1bV6VX131V1b1V9HLgK2Hmk7TVV9cGqugc4GdgS2GJk/Q7A54EjquqEsX0DSZK0TiTZAPg0cA2wLfBE4NTVbPb3wE7AbwCbAv8HuHfafg8CjgFeUFWXznH8XwI+A7wP2BzYEbhopMn+dDdZNgFWAkcBVNVz2vpntsdbPg68EVjV9rMF8BagVvNdJK0BCxiS+nLt1EKSA5Jc1O5UfA94Gl1viyk3TC1U1V1t8dEj638PuA44Y/7SlSRJ82hn4AnAm6rqB1X1o6r60myNW4+GPwDeUFXXVdU9VfWVqvrxSLM/A94E7F5VK1dz/FcD/1FVH6uqn1bVLVV10cj6T1bV+VV1N/ARugLHbH5Kd7Pll9q+/rOqLGBI64AFDEl9KfjZHY8PAocBj6+qjYFLgcy+6f0cCfwP8NF2B0eSJE2Wrel6XN69hu03Ax4BfGuONm8CPlBVq9bw+HPt64aR5bu4742U6d5F10vjs0muTnL4Ghxf0hqwgCGpb4+iK2bcDD/r6jnnM6oz+CnwiravU3zOVJKkiXMtsM1azE72P8CPgCfP0WZP4G1JfmcNjz/XvtZYVX2/qt5YVU8CXgL8RZI91sW+pfWdF/mSelVVlwPHAl8FbgSeDnz5AeznJ8DL6Z41XWIRQ5KkiXI+cD1wdJJHJXlEkt1ma1xV9wJLgHcneUKSDZL8ryQPH2l2GbAY+ECSl6zm+B8BXpDklUk2TPL4JDuuYe43Ak+a+pDkRUl+uQ04fjtwD9PG5pD0wKxphVOS1pmq2nba57cCb52l7UnASdNiGVnefWT5R8AL1lmikiRpLKrqniQvBo4DvkvXO/OjwIVzbPaXwP8FltM90vENYK9p+/1GkhcB/5bkp1X1mVmO/90k+9ANDPohusLD27jvQJ6zORI4uc2AcgjdAKTvpxvE8zbgH6rq82uwH0mrEceTkSRJkiRJQ2cXa0mSJEmSNHgWMCRJkiQteEl+L8mdM7wu6zs3SWvGR0gkSZIkSdLg2QNDkiRJkiQNnrOQjNhss81q22237TsNSZImygUXXPA/VbV533kMjdcVkiQ9MLNdW1jAGLHtttuyYsWKvtOQJGmiJLmm7xyGyOsKSZIemNmuLXyERJIkSZIkDZ4FDEmSJEmSNHgWMCRJkiRJ0uDNawEjydZJPp/k8iSXJXlDi2+a5OwkV7X3TVo8SY5LsjLJxUmePbKvA1v7q5IcOBLfKcklbZvjkmSuY0iSJEmSpMkz3z0w7gbeWFU7ALsChybZATgcOKeqtgfOaZ8B9ga2b69DgOOhK0YARwC7ADsDR4wUJI4HXjey3eIWn+0YkiRJkiRpwsxrAaOqrq+qC9vy94ErgCcC+wInt2YnAy9ty/sCp1TnPGDjJFsCewFnV9WtVXUbcDawuK17bFWdV1UFnDJtXzMdQ5IkSZIkTZixTaOaZFvgWcDXgC2q6vq26gZgi7b8RODakc1Wtdhc8VUzxJnjGNPzOoSutwfbbLPNfdbt9KZT1ui7jdsF7zpgtW2++46njyGTtbPNX1/SdwqSJPVqiNcWa3JdIUnSEIxlEM8kjwb+BfizqrpjdF3rOVHzefy5jlFVJ1TVoqpatPnmm89nGpIkSZIk6QGa9wJGkofSFS8+UlWfaOEb2+MftPebWvw6YOuRzbdqsbniW80Qn+sYkiRJkiRpwsz3LCQBTgSuqKp3j6xaCkzNJHIgcOZI/IA2G8muwO3tMZBlwJ5JNmmDd+4JLGvr7kiyazvWAdP2NdMxJEmSJEnShJnvMTB2A14DXJLkohZ7C3A0cFqSg4FrgFe2dWcB+wArgbuAgwCq6tYk7wSWt3bvqKpb2/LrgZOAjYDPtBdzHEOSJEmSJE2YeS1gVNWXgMyyeo8Z2hdw6Cz7WgIsmSG+AnjaDPFbZjqGJEmSJEmaPGMZxFOSJEmSJOnBsIAhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJKkiZFkSZKbklw6EjsyyXVJLmqvfUbWvTnJyiRXJtlrJL64xVYmOXwkvl2Sr7X4x5M8rMUf3j6vbOu3HdNXliRJjQUMSZI0SU4CFs8Qf09V7dheZwEk2QHYH/i1ts0/JNkgyQbAB4C9gR2AV7W2AMe0ff0ycBtwcIsfDNzW4u9p7SRJ0hhZwJAkSROjqs4Fbl3D5vsCp1bVj6vq28BKYOf2WllVV1fVT4BTgX2TBHg+cEbb/mTgpSP7OrktnwHs0dpLkqQxsYAhSZIWgsOSXNweMdmkxZ4IXDvSZlWLzRZ/PPC9qrp7Wvw++2rrb2/t7yPJIUlWJFlx8803r5tvJkmSAAsYkiRp8h0PPBnYEbgeOLavRKrqhKpaVFWLNt98877SkCRpQbKAIUmSJlpV3VhV91TVvcAH6R4RAbgO2Hqk6VYtNlv8FmDjJBtOi99nX23941p7SZI0JhYwJEnSREuy5cjHlwFTM5QsBfZvM4hsB2wPnA8sB7ZvM448jG6gz6VVVcDngf3a9gcCZ47s68C2vB/wudZekiSNyYarbyJJkjQMST4G7A5slmQVcASwe5IdgQK+A/wRQFVdluQ04HLgbuDQqrqn7ecwYBmwAbCkqi5rh/gr4NQkfwN8HTixxU8E/jnJSrpBRPef328qSZKms4AhSZImRlW9aobwiTPEptofBRw1Q/ws4KwZ4lfz80dQRuM/Al6xVslKkqR1ykdIJEmSJEnS4FnAkCRJkiRJg2cBQ5IkSZIkDZ4FDEmSJEmSNHgWMCRJkiRJ0uBZwJAkSZIkSYNnAUOSJEmSJA2eBQxJkiRJkjR4FjAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZIkSZI0eBYwJEmSJEnS4FnAkCRJkiRJg2cBQ5IkSZIkDZ4FDEmSNDGSLElyU5JLR2LvSvLNJBcn+WSSjVt82yQ/THJRe/3jyDY7JbkkycokxyVJi2+a5OwkV7X3TVo8rd3Kdpxnj/mrS5K03rOAIUmSJslJwOJpsbOBp1XVM4D/At48su5bVbVje/3xSPx44HXA9u01tc/DgXOqanvgnPYZYO+Rtoe07SVJ0hhZwJAkSROjqs4Fbp0W+2xV3d0+ngdsNdc+kmwJPLaqzquqAk4BXtpW7wuc3JZPnhY/pTrnARu3/UiSpDGZ1wLGLN08j0xy3Uh3zn1G1r25dc28MsleI/HFLbYyyeEj8e2SfK3FP57kYS3+8PZ5ZVu/7Xx+T0mSNBh/AHxm5PN2Sb6e5ItJfqvFngisGmmzqsUAtqiq69vyDcAWI9tcO8s2kiRpDOa7B8ZJ3L+bJ8B7RrpzngWQZAdgf+DX2jb/kGSDJBsAH6DrurkD8KrWFuCYtq9fBm4DDm7xg4HbWvw9rZ0kSVrAkrwVuBv4SAtdD2xTVc8C/gL4aJLHrun+Wu+MWsscDkmyIsmKm2++eW02lSRJqzGvBYyZunnOYV/g1Kr6cVV9G1gJ7NxeK6vq6qr6CXAqsG8bbOv5wBlt++ndPKe6f54B7DE1OJckSVp4krwWeBHwe63wQLumuKUtXwB8C3gKcB33fcxkqxYDuHHq0ZD2flOLXwdsPcs2P1NVJ1TVoqpatPnmm6+jbydJkqC/MTAOayN4L5ka3ZvZu2bOFn888L2RZ15Hu3L+bJu2/vbW/n68UyJJ0mRLshj4P8BLququkfjmrScnSZ5ENwDn1e0RkTuS7NpucBwAnNk2Wwoc2JYPnBY/oM1Gsitw+8ijJpIkaQz6KGAcDzwZ2JGua+exPeTwM94pkSRpciT5GPBV4FeSrEpyMPB+4DHA2dOmS30OcHGSi+h6ZP5xVU31DH09q0w9FwAAHXFJREFU8CG6Hp/f4ufjZhwNvDDJVcAL2meAs4CrW/sPtu0lSdIYbTjuA1bVjVPLST4IfLp9nKtr5kzxW+hGAN+w9bIYbT+1r1VJNgQe19pLkqQJVlWvmiF84ixt/wX4l1nWrQCeNkP8FmCPGeIFHLpWyUqSpHVq7D0wpk059jJgaoaSpcD+bQaR7ei6eZ4PLAe2bzOOPIxuoM+l7ULi88B+bfvp3Tynun/uB3xu6nlYSZIkSZI0eea1B0br5rk7sFmSVcARwO5JdqQb1fs7wB8BVNVlSU4DLqcbQfzQqrqn7ecwYBmwAbCkqi5rh/gr4NQkfwN8nZ/fgTkR+OckK+kGEd1/Pr+nJEmSJEmaX/NawFibbp6t/VHAUTPEz6J79nR6/Gq6WUqmx38EvGKtkpUkSZIkSYPV1ywkkiRJkiRJa8wChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEkauyR/l+SxSR6a5JwkNyf5/b7zkiRJw7Vh3wlIkqT10p5V9X+SvAz4DvBy4Fzgw71mpQVnt/ft1ncK9/PlP/ly3ylI0kSyB4YkSerDQ9v7bwOnV9XtfSYjSZKGzx4YkiSpD59K8k3gh8D/TrI58KOec5IkSQNmDwxJktSHI4DfABZV1U+Bu4CXrG6jJEuS3JTk0pHYpknOTnJVe9+kxZPkuCQrk1yc5Nkj2xzY2l+V5MCR+E5JLmnbHJckcx1DkiSNjwUMSZLUh69W1a1VdQ9AVf0A+MwabHcSsHha7HDgnKraHjinfQbYG9i+vQ4BjoeuGEFXQNkF2Bk4YqQgcTzwupHtFq/mGJIkaUwsYEiSpLFJ8otJdgI2SvKsJM9ur92BR65u+6o6F7h1Wnhf4OS2fDLw0pH4KdU5D9g4yZbAXsDZrYByG3A2sLite2xVnVdVBZwybV8zHUOSJI2JY2BIkqRx2gt4LbAV8O6R+PeBtzzAfW5RVde35RuALdryE4FrR9qtarG54qtmiM91jPtIcghdbw+22WabB/JdJEnSLCxgSJKksamqk4GTk/xOVf3LPOy/ktS63u+aHqOqTgBOAFi0aNG85iFJ0vrGAoYkSerDp5O8GtiWkeuRqnrHA9jXjUm2rKrr22MgN7X4dcDWI+22arHrgN2nxb/Q4lvN0H6uY0iSpDFxDAxJktSHM+nGlbgb+MHI64FYCkzNJHJg2/dU/IA2G8muwO3tMZBlwJ5JNmmDd+4JLGvr7kiya5t95IBp+5rpGJIkaUzsgaFB2e19u/Wdwv18+U++3HcKkrQQbVVV02cTWa0kH6PrPbFZklV0s4kcDZyW5GDgGuCVrflZwD7ASrppWg8CqKpbk7wTWN7avaOqpgYGfT3dTCcb0c2KMjUzymzHkCRJY7LGBYwkD6+qH0+LbTpywpckSVpTX0ny9Kq6ZG02qqpXzbJqjxnaFnDoLPtZAiyZIb4CeNoM8VtmOoYkSRqftXmE5BNJHjr1oT3/efa6T0mSJK0HfhO4IMmVSS5OckmSi/tOSpIkDdfaPELyr3RdJ/ejGxBrKfCX85GUJEla8PbuOwFJkjRZ1riAUVUfTPIwukLGtsAfVdVX5ikvSZK0sG0JXFZV3wdI8ljgqXTjS0iSJN3PagsYSf5i9COwDXARsGuSXavq3fOUmyRJWriOB5498vnOGWKSJEk/syY9MB4z7fMnZolLkiStqbRBNgGoqnuTODuaJEma1WovFKrq7eNIRJIkrVeuTvKndL0uoJu+9Ooe85EkSQO3NtOoPoVu0M5tR7erquev+7QkSdIC98fAccDbgALOAQ7pNSNJkjRoa9NV83TgH4EPAffMTzqSJGl9UFU3AfvPtj7Jm6vq/44xJUmSNHBrU8C4u6qOX30zSZKkB+0VgAUMSZL0Mw9Zi7afSvL6JFsm2XTqNdcGSZYkuSnJpSOxTZOcneSq9r5JiyfJcUlWJrk4ybNHtjmwtb8qyYEj8Z2SXNK2OS5J5jqGJEmaGOk7AUmSNCxrU8A4EHgT8BXggvZasZptTgIWT4sdDpxTVdvTPe96eIvvDWzfXofQBvVqRZIjgF2AnYEjRgoSxwOvG9lu8WqOIUmSJkOtvokkSVqfrHEBo6q2m+H1pNVscy5w67TwvsDJbflk4KUj8VOqcx6wcZItgb2As6vq1qq6DTgbWNzWPbaqzmvTsJ0ybV8zHUOSJE0Ge2BIkqT7WKv51pM8DdgBeMRUrKpOWctjblFV17flG4At2vITgWtH2q1qsbniq2aIz3WM+0lyCG3U82222WYtv4r0c198znP7TuF+nnvuF/tOQZJmlGTTqrp1Wmy7qvp2+3h6D2lJkqQBW+MeGEmOAN7XXs8D/g54yYM5eOs5Ma9dRFd3jKo6oaoWVdWizTfffD5TkSRJP/epJI+d+pBkB+BTU5+r6m97yUqSJA3W2oyBsR+wB3BDVR0EPBN43AM45o3t8Q/a+00tfh2w9Ui7rVpsrvhWM8TnOoYkSRqGv6UrYjw6yU50PS5+v+ecJEnSgK1NAeNHVXUvcHe7Y3IT9y0srKmldAOC0t7PHIkf0GYj2RW4vT0GsgzYM8kmbfDOPYFlbd0dSXZts48cMG1fMx1DkiQNQFX9G/Ae4LN0g36/rKou6jMnSZI0bGs0BkYrEFycZGPgg3QzkNwJfHU1230M2B3YLMkqutlEjgZOS3IwcA3wytb8LGAfYCVwF3AQQFXdmuSdwPLW7h0jz8y+nu6iZyPgM+3FHMeQJEk9SvI+7vto5+OAbwGHJaGq/vQB7vdXgI+PhJ4E/DWwMd2MZTe3+Fuq6qy2zZuBg4F7gD+tqmUtvhh4L7AB8KGqOrrFtwNOBR5Pdy30mqr6yQPJV5Ikrb01KmBUVSXZuaq+B/xjkn+nmwHk4tVs96pZVu0x0zGAQ2fZzxJgyQzxFcDTZojfMtMxJElS76ZPwX7ButhpVV0J7AiQZAO6x0o/SXdD5D1V9fej7duYG/sDvwY8AfiPJE9pqz8AvJBugPDlSZZW1eXAMW1fpyb5R7rix/HrIn9JkrR6azMLyYVJfr2qllfVd+YrIUmStHBV1ckASR5F93jqPe3zBsDD19Fh9gC+VVXXdJ1IZ7QvcGpV/Rj4dpKVwM5t3cqqurrldSqwb5IrgOcDr25tTgaOxAKGJEljszYFjF2A30tyDfADuvnZq6qeMS+ZSZKkhewc4AV0j6RC9zjoZ4HfWAf73h/42Mjnw5IcQNf7441VdRvd1OvnjbQZnY59+vTtu9A9NvK9qrp7hvY/4/TsErz/jZ9afaMxO+zYF/edgqR1YG0G8dwLeDLd3YcXAy9q75IkSWvrEVU1VbygLT/ywe40ycPopnk/vYWOp7t+2RG4Hjj2wR5jLk7PLknS/FnjHhhVdc18JiJJktYrP0jy7Kq6EKBNpfrDdbDfvYELq+pGgKn3dowPAp9uH2ebpp1Z4rcAGyfZsPXCGG0vSZLGYG16YEiSJK0rfwacnuQ/k3yJbgaRw9bBfl/FyOMjSbYcWfcy4NK2vBTYP8nD2+wi2wPn0816tn2S7Vpvjv2BpW2w8c8D+7XtnaZdkqQxW5sxMCRJktaJqlqe5FeBX2mhK6vqpw9mn21g0BcCfzQS/rskO9JN3fqdqXVVdVmS04DLgbuBQ0cGFD0MWEY3jeqSqrqs7euvgFOT/A3wdeDEB5OvJElaOxYwJEnS2CR5flV9LsnLp616ShKq6hMPdN9V9QO6wTZHY6+Zo/1RwFEzxM8CzpohfjU/n6lEkiSNmQUMSZI0Ts8FPsfMA4EX8IALGJIkaWGzgCFJksamqo5o7wf1nYskSZosFjAkSdLYJPmLudZX1bvHlYskSZosFjAkSdI4PWaOdTW2LCRJ0sSxgCFJksamqt4OkORk4A1V9b32eRPg2B5TkyRJA/eQvhOQJEnrpWdMFS8Aquo24Fn9pSNJkobOAoYkSerDQ1qvCwCSbIo9QyVJ0hy8UJAkSX04FvhqktPb51cAR/WYjzQoX3zOc/tO4X6ee+4X+05B0nrOAoYkSRq7qjolyQrg+S308qq6vM+cJEnSsFnAkNZz73/jp/pO4X4OO/bFfacgaQxawcKihSRJWiOOgSFJkiRJkgbPAoYkSZIkSRo8HyGRJEmStN476vf36zuF+3nrh8/oOwVpUCxgSJIkabW++46n953C/Wzz15f0nYI0CFcc9bm+U7ifp771+atvJK0lHyGRJEmSJEmDZwFDkiQtCEm+k+SSJBe1KVpJsmmSs5Nc1d43afEkOS7JyiQXJ3n2yH4ObO2vSnLgSHyntv+VbduM/1tKkrT+soAhSZIWkudV1Y5Vtah9Phw4p6q2B85pnwH2BrZvr0OA46EreABHALsAOwNHTBU9WpvXjWy3eP6/jiRJmmIBQ5IkLWT7Aie35ZOBl47ET6nOecDGSbYE9gLOrqpbq+o24GxgcVv32Ko6r6oKOGVkX5IkaQwsYEiSpIWigM8muSDJIS22RVVd35ZvALZoy08Erh3ZdlWLzRVfNUNckiSNibOQSJpYTncmaZrfrKrrkvwCcHaSb46urKpKUvOZQCucHAKwzTbbzOehJEla79gDQ5IkLQhVdV17vwn4JN0YFje2xz9o7ze15tcBW49svlWLzRXfaob49BxOqKpFVbVo8803XxdfS5IkNRYwJEnSxEvyqCSPmVoG9gQuBZYCUzOJHAic2ZaXAge02Uh2BW5vj5osA/ZMskkbvHNPYFlbd0eSXdvsIweM7EuSJI2Bj5BIkqSFYAvgk21m0w2Bj1bVvydZDpyW5GDgGuCVrf1ZwD7ASuAu4CCAqro1yTuB5a3dO6rq1rb8euAkYCPgM+0lSXqAjjzyyL5TuJ8h5qSfs4AhSZImXlVdDTxzhvgtwB4zxAs4dJZ9LQGWzBBfATztQScrSZIeEB8hkSRJkiRJg9dbD4wk3wG+D9wD3F1Vi5JsCnwc2Bb4DvDKqrqtPWv6XrqunncBr62qC9t+DgTe1nb7N1V1covvxM+7eZ4FvKHdbZEkSZIk6QE57fSd+05hRq98xfl9pzDv+u6B8byq2rGqFrXPhwPnVNX2wDntM8DewPbtdQhwPEAreBwB7EI30vgRbcAtWpvXjWy3eP6/jiRJkiRJmg99FzCm2xc4uS2fDLx0JH5Kdc4DNm5Toe0FnF1Vt1bVbcDZwOK27rFVdV7rdXHKyL4kSZIkSdKE6bOAUcBnk1yQ5JAW26JNUwZwA92I4gBPBK4d2XZVi80VXzVD/H6SHJJkRZIVN99884P5PpIkSZIkaZ70OQvJb1bVdUl+ATg7yTdHV1ZVJZn3MSuq6gTgBIBFixY5RoakeXfFUZ/rO4X7eepbn993CpIkSdKceuuBUVXXtfebgE/SjWFxY3v8g/Z+U2t+HbD1yOZbtdhc8a1miEuSJEmSpAnUSwEjyaOSPGZqGdgTuBRYChzYmh0InNmWlwIHpLMrcHt71GQZsGeSTdrgnXsCy9q6O5Ls2mYwOWBkX5IkSZIkacL09QjJFsAnu9oCGwIfrap/T7IcOC3JwcA1wCtb+7PoplBdSTeN6kEAVXVrkncCy1u7d1TVrW359fx8GtXPtJckSZIkSeulZ56xrO8U7ucb++21xm17KWBU1dXAM2eI3wLsMUO8gENn2dcSYMkM8RXA0x50spIkSZIkqXdDm0ZVkiRJkiTpfvqchUSSNEGOPPLIvlOY0VDzkiRJ0rplDwxJkiRJkjR4FjAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZI08ZJsneTzSS5PclmSN7T4kUmuS3JRe+0zss2bk6xMcmWSvUbii1tsZZLDR+LbJflai388ycPG+y0lSVq/OQuJJGnBO+30nftO4X5e+Yrz+05hobkbeGNVXZjkMcAFSc5u695TVX8/2jjJDsD+wK8BTwD+I8lT2uoPAC8EVgHLkyytqsuBY9q+Tk3yj8DBwPHz/s0kSRJgDwxJkrQAVNX1VXVhW/4+cAXwxDk22Rc4tap+XFXfBlYCO7fXyqq6uqp+ApwK7JskwPOBM9r2JwMvnZcvI0mSZmQBQ5IkLShJtgWeBXythQ5LcnGSJUk2abEnAteObLaqxWaLPx74XlXdPS0+/diHJFmRZMXNN9+8rr6SJEnCAoYkSVpAkjwa+Bfgz6rqDrpHPJ4M7AhcDxw7n8evqhOqalFVLdp8883n81CSJK13HANDkqSBeuYZy/pO4X6+sd9eq2/UkyQPpStefKSqPgFQVTeOrP8g8On28Tpg65HNt2oxZonfAmycZMPWC2O0vSRJGgN7YEiSpInXxqg4Ebiiqt49Et9ypNnLgEvb8lJg/yQPT7IdsD1wPrAc2L7NOPIwuoE+l1ZVAZ8H9mvbHwicOZ/fSZIk3Zc9MCRJ0kKwG/Aa4JIkF7XYW4BXJdkRKOA7wB8BVNVlSU4DLqebweTQqroHIMlhwDJgA2BJVV3W9vdXwKlJ/gb4Ol3BRJIkjYkFDEmSNPGq6ktAZlh11hzbHAUcNUP8rJm2q6qr6WYpkSRJPfAREkmSJEmSNHgWMCRJkiRJ0uBZwJAkSZIkSYNnAUOSJEmSJA2eBQxJkiRJkjR4FjAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZIkSZI0eBYwJEmSJEnS4FnAkCRJkiRJg2cBQ5IkSZIkDZ4FDEmSJEmSNHgWMCRJkiRJ0uBZwJAkSZIkSYNnAUOSJEmSJA3egi5gJFmc5MokK5Mc3nc+kiRpsnltIUlSfxZsASPJBsAHgL2BHYBXJdmh36wkSdKk8tpCkqR+LdgCBrAzsLKqrq6qnwCnAvv2nJMkSZpcXltIktSjVFXfOcyLJPsBi6vqD9vn1wC7VNVh09odAhzSPv4KcOU8pbQZ8D/ztO/5NKl5w+TmPql5w+TmPql5w+Tmbt7jN5+5/1JVbT5P+x6MNbm2GON1BUzuv0fzHr9JzX1S84bJzX1S84bJzd28ZzbjtcWG83jAiVBVJwAnzPdxkqyoqkXzfZx1bVLzhsnNfVLzhsnNfVLzhsnN3bzHb5JznyTjuq6Ayf1vat7jN6m5T2reMLm5T2reMLm5m/faWciPkFwHbD3yeasWkyRJeiC8tpAkqUcLuYCxHNg+yXZJHgbsDyztOSdJkjS5vLaQJKlHC/YRkqq6O8lhwDJgA2BJVV3WY0pj6U46DyY1b5jc3Cc1b5jc3Cc1b5jc3M17/CY590Hw2mKdMe/xm9TcJzVvmNzcJzVvmNzczXstLNhBPCVJkiRJ0sKxkB8hkSRJkiRJC4QFDEmSJEmSNHgWMCRJkiRJ0uBZwNB9JPnVJHskefS0+OK+clpTSXZO8utteYckf5Fkn77zWltJTuk7hwciyW+2n/mefecylyS7JHlsW94oyduTfCrJMUke13d+s0nyp0m2Xn3L4UnysCQHJHlB+/zqJO9PcmiSh/ad31ySPCnJXyZ5b5J3J/njqX8/0jh5fu6f5+f5NannZ5jcc7TnZ00iB/EcsyQHVdU/9Z3HTJL8KXAocAWwI/CGqjqzrbuwqp7dY3pzSnIEsDfdzDpnA7sAnwdeCCyrqqN6TG9WSaZPvxfgecDnAKrqJWNPag0lOb+qdm7Lr6P7t/NJYE/gU1V1dJ/5zSbJZcAz22wCJwB3AWcAe7T4y3tNcBZJbgd+AHwL+BhwelXd3G9WaybJR+j+33wk8D3g0cAn6H7mqaoD+8tudu134ouAc4F9gK/T5f8y4PVV9YXektN6xfPz+Hl+Hr9JPT/D5J6jPT9rElnAGLMk362qbfrOYyZJLgH+V1XdmWRbupPGP1fVe5N8vaqe1W+Gs2u57wg8HLgB2Kqq7kiyEfC1qnpGn/nNJsmFwOXAh4Ciu0D6GLA/QFV9sb/s5jb6byLJcmCfqro5yaOA86rq6f1mOLMkV1TVU9vyfS78k1xUVTv2ltwcknwd2Al4AfC7wEuAC+j+vXyiqr7fY3pzSnJxVT0jyYbAdcATquqeJAG+MeD/Py8Bdmy5PhI4q6p2T7INcObAfyc+Dngz8FLgF+h+v9wEnAkcXVXf6y05rTXPz+Pn+Xn8JvX8DJN7jvb8PH6enx88HyGZB0kunuV1CbBF3/nN4SFVdSdAVX0H2B3YO8m76U7cQ3Z3Vd1TVXcB36qqOwCq6ofAvf2mNqdFdCe4twK3t4rxD6vqi0O+OGoekmSTJI+nK4beDFBVPwDu7je1OV2a5KC2/I0kiwCSPAX4aX9prVZV1b1V9dmqOhh4AvAPwGLg6n5TW62HJHkY8Bi6uzxTXYEfDgy6iyrdnSnocn00QFV9l+HnfRpwG7B7VW1aVY+nu3t8W1unyeL5efw8P4/fpJ6fYXLP0Z6fx29Bnp+TfGZcx9pw9U30AGwB7EX3D3FUgK+MP501dmOSHavqIoB2p+dFwBJgkNX6ET9J8sh2gbTTVLBVOQd7gVRV9wLvSXJ6e7+Ryfn/8nF0F3cBKsmWVXV9uuezh3xB/YfAe5O8Dfgf4KtJrgWubeuG6j4/06r6KbAUWNruPgzZicA3gQ3o/hg4PcnVwK7AqX0mthofApYn+RrwW8AxAEk2B27tM7E1sG1VHTMaqKobgGOS/EFPOemB8/w8Zp6fezGp52eY3HO05+fxm9jzc5LZHlcMXU+78eThIyTrXpITgX+qqi/NsO6jVfXqHtJarSRb0d0puWGGdbtV1Zd7SGuNJHl4Vf14hvhmwJZVdUkPaa21JL8N7FZVb+k7lweqnai3qKpv953LXNIN9LQd3QXpqqq6seeU5pTkKVX1X33n8UAleQJAVf13ko3putl+t6rO7zWx1Ujya8BTgUur6pt957OmknwW+A/g5Kl/20m2AF4LvLCqXtBjelpLnp/75/l5fCbt/AyTfY72/Dxek3x+TnIP8EVmLoTuWlUbjSUPCxiSJC0sSTYBDgf2pXvGFuBGujuCR1fV9B6CkiRpnk3y+TnJpcDLquqqGdZdW1VjmYnHAoYkSeuRDHg2LEmS1ldDPz8n2Q+4pKqunGHdS6vqX8eShwUMSZLWHxnwbFiSJK2vJvn8PM7iiwUMSZIWmCQXz7YKeEpVPXyc+UiSpIV7fh5n8WVSRlOWJElrblJnw5IkaSGb2PPzaoovW4wrDwsYkiQtPJ8GHj017eaoJF8YezaSJAkm+/w8iOKLj5BIkiRJkqRZJTkR+Keq+tIM6z5aVa8eSx4WMCRJkiRJ0tA9pO8EJEmSJEmSVscChqSxS7JWz8kl2T3Jp+crH0mSJEnDZwFD0thV1W/0nYMkSVo4vDkirR8sYEgauyR3tvfdk3whyRlJvpnkI0nS1i1usQuBl49s+6gkS5Kcn+TrSfZt8fcm+eu2vFeSc5P4O06SpPWAN0ek9YMX95L69izgz4AdgCcBuyV5BPBB4MXATsAvjrR/K/C5qtoZeB7wriSPAt4M/G6S5wHHAQdV1b1j+xaSJKk33hyR1g/+Dyipb+dX1apWbLgI2Bb4VeDbVXVVdVMlfXik/Z7A4UkuAr4APALYpqruAl4HnA28v6q+NbZvIEmShsSbI9ICtWHfCUha7/14ZPkeVv97KcDvVNWVM6x7OnAL8IR1lJskSZo851fVKoB2w2Nb4E7azZEW/zBwSGu/J/CSJH/ZPk/dHLkiyeuAc4E/9+aI1D97YEgaom8C2yZ5cvv8qpF1y4A/GekO+qz2/kvAG+nuuuydZJcx5itJkobjgd4c2bG9tqmqK9o6b45IA2IBQ9LgVNWP6O6K/Ft7TvWmkdXvBB4KXJzkMuCdrZhxIvCXVfXfwMHAh1p3UUmSJG+OSAtAusfLJUmSJGkyJbmzqh6dZHe6GxovavH3Ayuq6qQki4H/D7gL+E/gyVX1oiQbtfhv0N3g/TbdWBlnA8dV1dIkOwEnAb/ebrRI6oEFDEmSJEmSNHg+QiJJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8P5/amPPP886sqQAAAAASUVORK5CYII=\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure()\n",
- "plt.figure(figsize=(15, 20))\n",
- "i = 1\n",
- "for col in ['click_article_id', 'click_timestamp', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', \n",
- " 'click_region', 'click_referrer_type', 'rank', 'click_cnts']:\n",
- " plot_envs = plt.subplot(5, 2, i)\n",
- " i += 1\n",
- " v = trn_click[col].value_counts().reset_index()[:10]\n",
- " fig = sns.barplot(x=v['index'], y=v[col])\n",
- " for item in fig.get_xticklabels():\n",
- " item.set_rotation(90)\n",
- " plt.title(col)\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "注:此处click_cnts直方图表示的是每篇文章对应用户的点击次数累计图\n",
- "\n",
- "也可以以用户角度分析,画出每个用户点击文章次数的直方图"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "4 1084627\n",
- "2 25894\n",
- "1 2102\n",
- "Name: click_environment, dtype: int64"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click['click_environment'].value_counts()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从点击环境click_environment来看,仅有2102次(占0.19%)点击环境为1;仅有25894次(占2.3%)点击环境为2;剩余(占97.6%)点击环境为4。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "1 678187\n",
- "3 395558\n",
- "4 38731\n",
- "5 141\n",
- "2 6\n",
- "Name: click_deviceGroup, dtype: int64"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click['click_deviceGroup'].value_counts()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从点击设备组click_deviceGroup来看,设备1占大部分(61%),设备3占36%。"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 测试集用户点击日志"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 249999 \n",
- " 160974 \n",
- " 1506959142820 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 19 \n",
- " 19 \n",
- " 281 \n",
- " 1506912747000 \n",
- " 259 \n",
- " \n",
- " \n",
- " 1 \n",
- " 249999 \n",
- " 160417 \n",
- " 1506959172820 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 18 \n",
- " 19 \n",
- " 281 \n",
- " 1506942089000 \n",
- " 173 \n",
- " \n",
- " \n",
- " 2 \n",
- " 249998 \n",
- " 160974 \n",
- " 1506959056066 \n",
- " 4 \n",
- " 1 \n",
- " 12 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 5 \n",
- " 5 \n",
- " 281 \n",
- " 1506912747000 \n",
- " 259 \n",
- " \n",
- " \n",
- " 3 \n",
- " 249998 \n",
- " 202557 \n",
- " 1506959086066 \n",
- " 4 \n",
- " 1 \n",
- " 12 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 4 \n",
- " 5 \n",
- " 327 \n",
- " 1506938401000 \n",
- " 219 \n",
- " \n",
- " \n",
- " 4 \n",
- " 249997 \n",
- " 183665 \n",
- " 1506959088613 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 15 \n",
- " 5 \n",
- " 7 \n",
- " 7 \n",
- " 301 \n",
- " 1500895686000 \n",
- " 256 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "200000"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "0 249999 160974 1506959142820 4 \n",
- "1 249999 160417 1506959172820 4 \n",
- "2 249998 160974 1506959056066 4 \n",
- "3 249998 202557 1506959086066 4 \n",
- "4 249997 183665 1506959088613 4 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "0 1 17 1 13 \n",
- "1 1 17 1 13 \n",
- "2 1 12 1 13 \n",
- "3 1 12 1 13 \n",
- "4 1 17 1 15 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
- "0 2 19 19 281 1506912747000 \n",
- "1 2 18 19 281 1506942089000 \n",
- "2 2 5 5 281 1506912747000 \n",
- "3 2 4 5 327 1506938401000 \n",
- "4 5 7 7 301 1500895686000 \n",
- "\n",
- " words_count \n",
- "0 259 \n",
- "1 173 \n",
- "2 259 \n",
- "3 219 \n",
- "4 256 "
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])\n",
- "tst_click.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 5.180100e+05 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 5.180100e+05 \n",
- " 518010.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 227342.428169 \n",
- " 193803.792550 \n",
- " 1.507387e+12 \n",
- " 3.947300 \n",
- " 1.738285 \n",
- " 13.628467 \n",
- " 1.348209 \n",
- " 18.250250 \n",
- " 1.819614 \n",
- " 15.521785 \n",
- " 30.043586 \n",
- " 305.324961 \n",
- " 1.506883e+12 \n",
- " 210.966331 \n",
- " \n",
- " \n",
- " std \n",
- " 14613.907188 \n",
- " 88279.388177 \n",
- " 3.706127e+08 \n",
- " 0.323916 \n",
- " 1.020858 \n",
- " 6.625564 \n",
- " 1.703524 \n",
- " 7.060798 \n",
- " 1.082657 \n",
- " 33.957702 \n",
- " 56.868021 \n",
- " 110.411513 \n",
- " 5.816668e+09 \n",
- " 83.040065 \n",
- " \n",
- " \n",
- " min \n",
- " 200000.000000 \n",
- " 137.000000 \n",
- " 1.506959e+12 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 2.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.265812e+12 \n",
- " 0.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 214926.000000 \n",
- " 128551.000000 \n",
- " 1.507026e+12 \n",
- " 4.000000 \n",
- " 1.000000 \n",
- " 12.000000 \n",
- " 1.000000 \n",
- " 13.000000 \n",
- " 1.000000 \n",
- " 4.000000 \n",
- " 10.000000 \n",
- " 252.000000 \n",
- " 1.506970e+12 \n",
- " 176.000000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 229109.000000 \n",
- " 199197.000000 \n",
- " 1.507308e+12 \n",
- " 4.000000 \n",
- " 1.000000 \n",
- " 17.000000 \n",
- " 1.000000 \n",
- " 21.000000 \n",
- " 2.000000 \n",
- " 8.000000 \n",
- " 19.000000 \n",
- " 323.000000 \n",
- " 1.507249e+12 \n",
- " 199.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 240182.000000 \n",
- " 272143.000000 \n",
- " 1.507666e+12 \n",
- " 4.000000 \n",
- " 3.000000 \n",
- " 17.000000 \n",
- " 1.000000 \n",
- " 25.000000 \n",
- " 2.000000 \n",
- " 18.000000 \n",
- " 35.000000 \n",
- " 399.000000 \n",
- " 1.507630e+12 \n",
- " 232.000000 \n",
- " \n",
- " \n",
- " max \n",
- " 249999.000000 \n",
- " 364043.000000 \n",
- " 1.508832e+12 \n",
- " 4.000000 \n",
- " 5.000000 \n",
- " 20.000000 \n",
- " 11.000000 \n",
- " 28.000000 \n",
- " 7.000000 \n",
- " 938.000000 \n",
- " 938.000000 \n",
- " 460.000000 \n",
- " 1.509949e+12 \n",
- " 3082.000000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "#训练集中的用户数量为20w\n",
+ "trn_click.user_id.nunique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T16:03:01.378461Z",
+ "start_time": "2020-11-13T16:03:01.300712Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "count 518010.000000 518010.000000 5.180100e+05 518010.000000 \n",
- "mean 227342.428169 193803.792550 1.507387e+12 3.947300 \n",
- "std 14613.907188 88279.388177 3.706127e+08 0.323916 \n",
- "min 200000.000000 137.000000 1.506959e+12 1.000000 \n",
- "25% 214926.000000 128551.000000 1.507026e+12 4.000000 \n",
- "50% 229109.000000 199197.000000 1.507308e+12 4.000000 \n",
- "75% 240182.000000 272143.000000 1.507666e+12 4.000000 \n",
- "max 249999.000000 364043.000000 1.508832e+12 4.000000 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
- "mean 1.738285 13.628467 1.348209 18.250250 \n",
- "std 1.020858 6.625564 1.703524 7.060798 \n",
- "min 1.000000 2.000000 1.000000 1.000000 \n",
- "25% 1.000000 12.000000 1.000000 13.000000 \n",
- "50% 1.000000 17.000000 1.000000 21.000000 \n",
- "75% 3.000000 17.000000 1.000000 25.000000 \n",
- "max 5.000000 20.000000 11.000000 28.000000 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id \\\n",
- "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
- "mean 1.819614 15.521785 30.043586 305.324961 \n",
- "std 1.082657 33.957702 56.868021 110.411513 \n",
- "min 1.000000 1.000000 1.000000 1.000000 \n",
- "25% 1.000000 4.000000 10.000000 252.000000 \n",
- "50% 2.000000 8.000000 19.000000 323.000000 \n",
- "75% 2.000000 18.000000 35.000000 399.000000 \n",
- "max 7.000000 938.000000 938.000000 460.000000 \n",
- "\n",
- " created_at_ts words_count \n",
- "count 5.180100e+05 518010.000000 \n",
- "mean 1.506883e+12 210.966331 \n",
- "std 5.816668e+09 83.040065 \n",
- "min 1.265812e+12 0.000000 \n",
- "25% 1.506970e+12 176.000000 \n",
- "50% 1.507249e+12 199.000000 \n",
- "75% 1.507630e+12 232.000000 \n",
- "max 1.509949e+12 3082.000000 "
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tst_click.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "我们可以看出训练集和测试集的用户是完全不一样的\n",
- "\n",
- "训练集的用户ID由0 ~ 199999,而测试集A的用户ID由200000 ~ 249999。\n",
- "\n",
- "因此,也就是我们在训练时,需要把测试集的数据也包括在内,称为全量数据。\n",
- "\n",
- "!!!!!!!!!!!!!!!后续将对训练集和测试集合并分析!!!!!!!!!!!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "50000"
- ]
- },
- "execution_count": 18,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#测试集中的用户数量为5w\n",
- "tst_click.user_id.nunique()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:56:07.717463Z",
- "start_time": "2020-11-13T15:56:07.693494Z"
- }
- },
- "outputs": [
+ "source": [
+ "trn_click.groupby('user_id')['click_article_id'].count().min() # 训练集里面每个用户至少点击了两篇文章"
+ ]
+ },
{
- "data": {
- "text/plain": [
- "1"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### 画直方图大体看一下基本的属性分布"
]
- },
- "execution_count": 19,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tst_click.groupby('user_id')['click_article_id'].count().min() # 注意测试集里面有只点击过一次文章的用户"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻文章信息数据表"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:20:34.183761Z",
- "start_time": "2020-11-13T15:20:34.164770Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " click_article_id \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 0 \n",
- " 0 \n",
- " 1513144419000 \n",
- " 168 \n",
- " \n",
- " \n",
- " 1 \n",
- " 1 \n",
- " 1 \n",
- " 1405341936000 \n",
- " 189 \n",
- " \n",
- " \n",
- " 2 \n",
- " 2 \n",
- " 1 \n",
- " 1408667706000 \n",
- " 250 \n",
- " \n",
- " \n",
- " 3 \n",
- " 3 \n",
- " 1 \n",
- " 1408468313000 \n",
- " 230 \n",
- " \n",
- " \n",
- " 4 \n",
- " 4 \n",
- " 1 \n",
- " 1407071171000 \n",
- " 162 \n",
- " \n",
- " \n",
- " 364042 \n",
- " 364042 \n",
- " 460 \n",
- " 1434034118000 \n",
- " 144 \n",
- " \n",
- " \n",
- " 364043 \n",
- " 364043 \n",
- " 460 \n",
- " 1434148472000 \n",
- " 463 \n",
- " \n",
- " \n",
- " 364044 \n",
- " 364044 \n",
- " 460 \n",
- " 1457974279000 \n",
- " 177 \n",
- " \n",
- " \n",
- " 364045 \n",
- " 364045 \n",
- " 460 \n",
- " 1515964737000 \n",
- " 126 \n",
- " \n",
- " \n",
- " 364046 \n",
- " 364046 \n",
- " 460 \n",
- " 1505811330000 \n",
- " 479 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n",
+ "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " click_article_id category_id created_at_ts words_count\n",
- "0 0 0 1513144419000 168\n",
- "1 1 1 1405341936000 189\n",
- "2 2 1 1408667706000 250\n",
- "3 3 1 1408468313000 230\n",
- "4 4 1 1407071171000 162\n",
- "364042 364042 460 1434034118000 144\n",
- "364043 364043 460 1434148472000 463\n",
- "364044 364044 460 1457974279000 177\n",
- "364045 364045 460 1515964737000 126\n",
- "364046 364046 460 1505811330000 479"
- ]
- },
- "execution_count": 20,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#新闻文章数据集浏览\n",
- "item_df.head().append(item_df.tail())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:28:13.084501Z",
- "start_time": "2020-11-13T15:28:13.062561Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "176 3485\n",
- "182 3480\n",
- "179 3463\n",
- "178 3458\n",
- "174 3456\n",
- "183 3432\n",
- "184 3427\n",
- "173 3414\n",
- "180 3403\n",
- "177 3391\n",
- "170 3387\n",
- "187 3355\n",
- "169 3352\n",
- "185 3348\n",
- "175 3346\n",
- "181 3330\n",
- "186 3328\n",
- "189 3327\n",
- "171 3327\n",
- "172 3322\n",
- "165 3308\n",
- "188 3288\n",
- "167 3269\n",
- "190 3261\n",
- "192 3257\n",
- "168 3248\n",
- "193 3225\n",
- "166 3199\n",
- "191 3182\n",
- "194 3164\n",
- " ... \n",
- "601 1\n",
- "857 1\n",
- "1977 1\n",
- "1626 1\n",
- "697 1\n",
- "1720 1\n",
- "696 1\n",
- "706 1\n",
- "592 1\n",
- "1605 1\n",
- "586 1\n",
- "582 1\n",
- "1606 1\n",
- "972 1\n",
- "716 1\n",
- "584 1\n",
- "1608 1\n",
- "715 1\n",
- "841 1\n",
- "968 1\n",
- "964 1\n",
- "587 1\n",
- "1099 1\n",
- "1355 1\n",
- "711 1\n",
- "845 1\n",
- "710 1\n",
- "965 1\n",
- "847 1\n",
- "1535 1\n",
- "Name: words_count, Length: 866, dtype: int64"
- ]
- },
- "execution_count": 21,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_df['words_count'].value_counts()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:28:59.029535Z",
- "start_time": "2020-11-13T15:28:58.816106Z"
- }
- },
- "outputs": [
+ "source": [
+ "plt.figure()\n",
+ "plt.figure(figsize=(15, 20))\n",
+ "i = 1\n",
+ "for col in ['click_article_id', 'click_timestamp', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', \n",
+ " 'click_region', 'click_referrer_type', 'rank', 'click_cnts']:\n",
+ " plot_envs = plt.subplot(5, 2, i)\n",
+ " i += 1\n",
+ " v = trn_click[col].value_counts().reset_index()[:10]\n",
+ " fig = sns.barplot(x=v['index'], y=v[col])\n",
+ " for item in fig.get_xticklabels():\n",
+ " item.set_rotation(90)\n",
+ " plt.title(col)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "461\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "注:此处click_cnts直方图表示的是每篇文章对应用户的点击次数累计图\n",
+ "\n",
+ "也可以以用户角度分析,画出每个用户点击文章次数的直方图"
+ ]
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4 1084627\n",
+ "2 25894\n",
+ "1 2102\n",
+ "Name: click_environment, dtype: int64"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trn_click['click_environment'].value_counts()"
]
- },
- "execution_count": 22,
- "metadata": {},
- "output_type": "execute_result"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从点击环境click_environment来看,仅有2102次(占0.19%)点击环境为1;仅有25894次(占2.3%)点击环境为2;剩余(占97.6%)点击环境为4。"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "print(item_df['category_id'].nunique()) # 461个文章主题\n",
- "item_df['category_id'].hist()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(364047, 4)"
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_df.shape # 364047篇文章"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻文章embedding向量表示"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " article_id \n",
- " emb_0 \n",
- " emb_1 \n",
- " emb_2 \n",
- " emb_3 \n",
- " emb_4 \n",
- " emb_5 \n",
- " emb_6 \n",
- " emb_7 \n",
- " emb_8 \n",
- " ... \n",
- " emb_240 \n",
- " emb_241 \n",
- " emb_242 \n",
- " emb_243 \n",
- " emb_244 \n",
- " emb_245 \n",
- " emb_246 \n",
- " emb_247 \n",
- " emb_248 \n",
- " emb_249 \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 0 \n",
- " -0.161183 \n",
- " -0.957233 \n",
- " -0.137944 \n",
- " 0.050855 \n",
- " 0.830055 \n",
- " 0.901365 \n",
- " -0.335148 \n",
- " -0.559561 \n",
- " -0.500603 \n",
- " ... \n",
- " 0.321248 \n",
- " 0.313999 \n",
- " 0.636412 \n",
- " 0.169179 \n",
- " 0.540524 \n",
- " -0.813182 \n",
- " 0.286870 \n",
- " -0.231686 \n",
- " 0.597416 \n",
- " 0.409623 \n",
- " \n",
- " \n",
- " 1 \n",
- " 1 \n",
- " -0.523216 \n",
- " -0.974058 \n",
- " 0.738608 \n",
- " 0.155234 \n",
- " 0.626294 \n",
- " 0.485297 \n",
- " -0.715657 \n",
- " -0.897996 \n",
- " -0.359747 \n",
- " ... \n",
- " -0.487843 \n",
- " 0.823124 \n",
- " 0.412688 \n",
- " -0.338654 \n",
- " 0.320786 \n",
- " 0.588643 \n",
- " -0.594137 \n",
- " 0.182828 \n",
- " 0.397090 \n",
- " -0.834364 \n",
- " \n",
- " \n",
- " 2 \n",
- " 2 \n",
- " -0.619619 \n",
- " -0.972960 \n",
- " -0.207360 \n",
- " -0.128861 \n",
- " 0.044748 \n",
- " -0.387535 \n",
- " -0.730477 \n",
- " -0.066126 \n",
- " -0.754899 \n",
- " ... \n",
- " 0.454756 \n",
- " 0.473184 \n",
- " 0.377866 \n",
- " -0.863887 \n",
- " -0.383365 \n",
- " 0.137721 \n",
- " -0.810877 \n",
- " -0.447580 \n",
- " 0.805932 \n",
- " -0.285284 \n",
- " \n",
- " \n",
- " 3 \n",
- " 3 \n",
- " -0.740843 \n",
- " -0.975749 \n",
- " 0.391698 \n",
- " 0.641738 \n",
- " -0.268645 \n",
- " 0.191745 \n",
- " -0.825593 \n",
- " -0.710591 \n",
- " -0.040099 \n",
- " ... \n",
- " 0.271535 \n",
- " 0.036040 \n",
- " 0.480029 \n",
- " -0.763173 \n",
- " 0.022627 \n",
- " 0.565165 \n",
- " -0.910286 \n",
- " -0.537838 \n",
- " 0.243541 \n",
- " -0.885329 \n",
- " \n",
- " \n",
- " 4 \n",
- " 4 \n",
- " -0.279052 \n",
- " -0.972315 \n",
- " 0.685374 \n",
- " 0.113056 \n",
- " 0.238315 \n",
- " 0.271913 \n",
- " -0.568816 \n",
- " 0.341194 \n",
- " -0.600554 \n",
- " ... \n",
- " 0.238286 \n",
- " 0.809268 \n",
- " 0.427521 \n",
- " -0.615932 \n",
- " -0.503697 \n",
- " 0.614450 \n",
- " -0.917760 \n",
- " -0.424061 \n",
- " 0.185484 \n",
- " -0.580292 \n",
- " \n",
- " \n",
- "
\n",
- "
5 rows × 251 columns
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1 678187\n",
+ "3 395558\n",
+ "4 38731\n",
+ "5 141\n",
+ "2 6\n",
+ "Name: click_deviceGroup, dtype: int64"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " article_id emb_0 emb_1 emb_2 emb_3 emb_4 emb_5 \\\n",
- "0 0 -0.161183 -0.957233 -0.137944 0.050855 0.830055 0.901365 \n",
- "1 1 -0.523216 -0.974058 0.738608 0.155234 0.626294 0.485297 \n",
- "2 2 -0.619619 -0.972960 -0.207360 -0.128861 0.044748 -0.387535 \n",
- "3 3 -0.740843 -0.975749 0.391698 0.641738 -0.268645 0.191745 \n",
- "4 4 -0.279052 -0.972315 0.685374 0.113056 0.238315 0.271913 \n",
- "\n",
- " emb_6 emb_7 emb_8 ... emb_240 emb_241 emb_242 \\\n",
- "0 -0.335148 -0.559561 -0.500603 ... 0.321248 0.313999 0.636412 \n",
- "1 -0.715657 -0.897996 -0.359747 ... -0.487843 0.823124 0.412688 \n",
- "2 -0.730477 -0.066126 -0.754899 ... 0.454756 0.473184 0.377866 \n",
- "3 -0.825593 -0.710591 -0.040099 ... 0.271535 0.036040 0.480029 \n",
- "4 -0.568816 0.341194 -0.600554 ... 0.238286 0.809268 0.427521 \n",
- "\n",
- " emb_243 emb_244 emb_245 emb_246 emb_247 emb_248 emb_249 \n",
- "0 0.169179 0.540524 -0.813182 0.286870 -0.231686 0.597416 0.409623 \n",
- "1 -0.338654 0.320786 0.588643 -0.594137 0.182828 0.397090 -0.834364 \n",
- "2 -0.863887 -0.383365 0.137721 -0.810877 -0.447580 0.805932 -0.285284 \n",
- "3 -0.763173 0.022627 0.565165 -0.910286 -0.537838 0.243541 -0.885329 \n",
- "4 -0.615932 -0.503697 0.614450 -0.917760 -0.424061 0.185484 -0.580292 \n",
- "\n",
- "[5 rows x 251 columns]"
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_emb_df.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(295141, 251)"
- ]
- },
- "execution_count": 25,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_emb_df.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 数据分析"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户重复点击"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:30:20.899771Z",
- "start_time": "2020-11-13T15:30:20.750817Z"
- }
- },
- "outputs": [],
- "source": [
- "#####merge\n",
- "user_click_merge = trn_click.append(tst_click)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:30:26.290038Z",
- "start_time": "2020-11-13T15:30:25.339579Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 0 \n",
- " 30760 \n",
- " 1 \n",
- " \n",
- " \n",
- " 1 \n",
- " 0 \n",
- " 157507 \n",
- " 1 \n",
- " \n",
- " \n",
- " 2 \n",
- " 1 \n",
- " 63746 \n",
- " 1 \n",
- " \n",
- " \n",
- " 3 \n",
- " 1 \n",
- " 289197 \n",
- " 1 \n",
- " \n",
- " \n",
- " 4 \n",
- " 2 \n",
- " 36162 \n",
- " 1 \n",
- " \n",
- " \n",
- " 5 \n",
- " 2 \n",
- " 168401 \n",
- " 1 \n",
- " \n",
- " \n",
- " 6 \n",
- " 3 \n",
- " 36162 \n",
- " 1 \n",
- " \n",
- " \n",
- " 7 \n",
- " 3 \n",
- " 50644 \n",
- " 1 \n",
- " \n",
- " \n",
- " 8 \n",
- " 4 \n",
- " 39894 \n",
- " 1 \n",
- " \n",
- " \n",
- " 9 \n",
- " 4 \n",
- " 42567 \n",
- " 1 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "trn_click['click_deviceGroup'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从点击设备组click_deviceGroup来看,设备1占大部分(61%),设备3占36%。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 测试集用户点击日志"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 249999 \n",
+ " 160974 \n",
+ " 1506959142820 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 19 \n",
+ " 19 \n",
+ " 281 \n",
+ " 1506912747000 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 249999 \n",
+ " 160417 \n",
+ " 1506959172820 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 18 \n",
+ " 19 \n",
+ " 281 \n",
+ " 1506942089000 \n",
+ " 173 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 249998 \n",
+ " 160974 \n",
+ " 1506959056066 \n",
+ " 4 \n",
+ " 1 \n",
+ " 12 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 5 \n",
+ " 5 \n",
+ " 281 \n",
+ " 1506912747000 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 249998 \n",
+ " 202557 \n",
+ " 1506959086066 \n",
+ " 4 \n",
+ " 1 \n",
+ " 12 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 4 \n",
+ " 5 \n",
+ " 327 \n",
+ " 1506938401000 \n",
+ " 219 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 249997 \n",
+ " 183665 \n",
+ " 1506959088613 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 15 \n",
+ " 5 \n",
+ " 7 \n",
+ " 7 \n",
+ " 301 \n",
+ " 1500895686000 \n",
+ " 256 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "0 249999 160974 1506959142820 4 \n",
+ "1 249999 160417 1506959172820 4 \n",
+ "2 249998 160974 1506959056066 4 \n",
+ "3 249998 202557 1506959086066 4 \n",
+ "4 249997 183665 1506959088613 4 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "0 1 17 1 13 \n",
+ "1 1 17 1 13 \n",
+ "2 1 12 1 13 \n",
+ "3 1 12 1 13 \n",
+ "4 1 17 1 15 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
+ "0 2 19 19 281 1506912747000 \n",
+ "1 2 18 19 281 1506942089000 \n",
+ "2 2 5 5 281 1506912747000 \n",
+ "3 2 4 5 327 1506938401000 \n",
+ "4 5 7 7 301 1500895686000 \n",
+ "\n",
+ " words_count \n",
+ "0 259 \n",
+ "1 173 \n",
+ "2 259 \n",
+ "3 219 \n",
+ "4 256 "
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id count\n",
- "0 0 30760 1\n",
- "1 0 157507 1\n",
- "2 1 63746 1\n",
- "3 1 289197 1\n",
- "4 2 36162 1\n",
- "5 2 168401 1\n",
- "6 3 36162 1\n",
- "7 3 50644 1\n",
- "8 4 39894 1\n",
- "9 4 42567 1"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#用户重复点击\n",
- "user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['click_timestamp'].agg({'count'}).reset_index()\n",
- "user_click_count[:10]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:34:27.418638Z",
- "start_time": "2020-11-13T15:34:27.372761Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 311242 \n",
- " 86295 \n",
- " 74254 \n",
- " 10 \n",
- " \n",
- " \n",
- " 311243 \n",
- " 86295 \n",
- " 76268 \n",
- " 10 \n",
- " \n",
- " \n",
- " 393761 \n",
- " 103237 \n",
- " 205948 \n",
- " 10 \n",
- " \n",
- " \n",
- " 393763 \n",
- " 103237 \n",
- " 235689 \n",
- " 10 \n",
- " \n",
- " \n",
- " 576902 \n",
- " 134850 \n",
- " 69463 \n",
- " 13 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])\n",
+ "tst_click.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 5.180100e+05 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 5.180100e+05 \n",
+ " 518010.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 227342.428169 \n",
+ " 193803.792550 \n",
+ " 1.507387e+12 \n",
+ " 3.947300 \n",
+ " 1.738285 \n",
+ " 13.628467 \n",
+ " 1.348209 \n",
+ " 18.250250 \n",
+ " 1.819614 \n",
+ " 15.521785 \n",
+ " 30.043586 \n",
+ " 305.324961 \n",
+ " 1.506883e+12 \n",
+ " 210.966331 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 14613.907188 \n",
+ " 88279.388177 \n",
+ " 3.706127e+08 \n",
+ " 0.323916 \n",
+ " 1.020858 \n",
+ " 6.625564 \n",
+ " 1.703524 \n",
+ " 7.060798 \n",
+ " 1.082657 \n",
+ " 33.957702 \n",
+ " 56.868021 \n",
+ " 110.411513 \n",
+ " 5.816668e+09 \n",
+ " 83.040065 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 200000.000000 \n",
+ " 137.000000 \n",
+ " 1.506959e+12 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 2.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.265812e+12 \n",
+ " 0.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 214926.000000 \n",
+ " 128551.000000 \n",
+ " 1.507026e+12 \n",
+ " 4.000000 \n",
+ " 1.000000 \n",
+ " 12.000000 \n",
+ " 1.000000 \n",
+ " 13.000000 \n",
+ " 1.000000 \n",
+ " 4.000000 \n",
+ " 10.000000 \n",
+ " 252.000000 \n",
+ " 1.506970e+12 \n",
+ " 176.000000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 229109.000000 \n",
+ " 199197.000000 \n",
+ " 1.507308e+12 \n",
+ " 4.000000 \n",
+ " 1.000000 \n",
+ " 17.000000 \n",
+ " 1.000000 \n",
+ " 21.000000 \n",
+ " 2.000000 \n",
+ " 8.000000 \n",
+ " 19.000000 \n",
+ " 323.000000 \n",
+ " 1.507249e+12 \n",
+ " 199.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 240182.000000 \n",
+ " 272143.000000 \n",
+ " 1.507666e+12 \n",
+ " 4.000000 \n",
+ " 3.000000 \n",
+ " 17.000000 \n",
+ " 1.000000 \n",
+ " 25.000000 \n",
+ " 2.000000 \n",
+ " 18.000000 \n",
+ " 35.000000 \n",
+ " 399.000000 \n",
+ " 1.507630e+12 \n",
+ " 232.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 249999.000000 \n",
+ " 364043.000000 \n",
+ " 1.508832e+12 \n",
+ " 4.000000 \n",
+ " 5.000000 \n",
+ " 20.000000 \n",
+ " 11.000000 \n",
+ " 28.000000 \n",
+ " 7.000000 \n",
+ " 938.000000 \n",
+ " 938.000000 \n",
+ " 460.000000 \n",
+ " 1.509949e+12 \n",
+ " 3082.000000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "count 518010.000000 518010.000000 5.180100e+05 518010.000000 \n",
+ "mean 227342.428169 193803.792550 1.507387e+12 3.947300 \n",
+ "std 14613.907188 88279.388177 3.706127e+08 0.323916 \n",
+ "min 200000.000000 137.000000 1.506959e+12 1.000000 \n",
+ "25% 214926.000000 128551.000000 1.507026e+12 4.000000 \n",
+ "50% 229109.000000 199197.000000 1.507308e+12 4.000000 \n",
+ "75% 240182.000000 272143.000000 1.507666e+12 4.000000 \n",
+ "max 249999.000000 364043.000000 1.508832e+12 4.000000 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
+ "mean 1.738285 13.628467 1.348209 18.250250 \n",
+ "std 1.020858 6.625564 1.703524 7.060798 \n",
+ "min 1.000000 2.000000 1.000000 1.000000 \n",
+ "25% 1.000000 12.000000 1.000000 13.000000 \n",
+ "50% 1.000000 17.000000 1.000000 21.000000 \n",
+ "75% 3.000000 17.000000 1.000000 25.000000 \n",
+ "max 5.000000 20.000000 11.000000 28.000000 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id \\\n",
+ "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
+ "mean 1.819614 15.521785 30.043586 305.324961 \n",
+ "std 1.082657 33.957702 56.868021 110.411513 \n",
+ "min 1.000000 1.000000 1.000000 1.000000 \n",
+ "25% 1.000000 4.000000 10.000000 252.000000 \n",
+ "50% 2.000000 8.000000 19.000000 323.000000 \n",
+ "75% 2.000000 18.000000 35.000000 399.000000 \n",
+ "max 7.000000 938.000000 938.000000 460.000000 \n",
+ "\n",
+ " created_at_ts words_count \n",
+ "count 5.180100e+05 518010.000000 \n",
+ "mean 1.506883e+12 210.966331 \n",
+ "std 5.816668e+09 83.040065 \n",
+ "min 1.265812e+12 0.000000 \n",
+ "25% 1.506970e+12 176.000000 \n",
+ "50% 1.507249e+12 199.000000 \n",
+ "75% 1.507630e+12 232.000000 \n",
+ "max 1.509949e+12 3082.000000 "
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id count\n",
- "311242 86295 74254 10\n",
- "311243 86295 76268 10\n",
- "393761 103237 205948 10\n",
- "393763 103237 235689 10\n",
- "576902 134850 69463 13"
- ]
- },
- "execution_count": 28,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_count[user_click_count['count']>7]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:32:53.298575Z",
- "start_time": "2020-11-13T15:32:53.285611Z"
- }
- },
- "outputs": [
+ "source": [
+ "tst_click.describe()"
+ ]
+ },
{
- "data": {
- "text/plain": [
- "array([ 1, 2, 4, 3, 6, 5, 10, 7, 13])"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "我们可以看出训练集和测试集的用户是完全不一样的\n",
+ "\n",
+ "训练集的用户ID由0 ~ 199999,而测试集A的用户ID由200000 ~ 249999。\n",
+ "\n",
+ "因此,也就是我们在训练时,需要把测试集的数据也包括在内,称为全量数据。\n",
+ "\n",
+ "!!!!!!!!!!!!!!!后续将对训练集和测试集合并分析!!!!!!!!!!!"
]
- },
- "execution_count": 29,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_count['count'].unique()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "1 1605541\n",
- "2 11621\n",
- "3 422\n",
- "4 77\n",
- "5 26\n",
- "6 12\n",
- "10 4\n",
- "7 3\n",
- "13 1\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 30,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#用户点击新闻次数\n",
- "user_click_count.loc[:,'count'].value_counts() "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "###### 可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击环境变化分析"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:39:41.961797Z",
- "start_time": "2020-11-13T15:39:41.949829Z"
- }
- },
- "outputs": [],
- "source": [
- "def plot_envs(df, cols, r, c):\n",
- " plt.figure()\n",
- " plt.figure(figsize=(10, 5))\n",
- " i = 1\n",
- " for col in cols:\n",
- " plt.subplot(r, c, i)\n",
- " i += 1\n",
- " v = df[col].value_counts().reset_index()\n",
- " fig = sns.barplot(x=v['index'], y=v[col])\n",
- " for item in fig.get_xticklabels():\n",
- " item.set_rotation(90)\n",
- " plt.title(col)\n",
- " plt.tight_layout()\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:39:55.476626Z",
- "start_time": "2020-11-13T15:39:48.764592Z"
- }
- },
- "outputs": [
+ },
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "50000"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#测试集中的用户数量为5w\n",
+ "tst_click.user_id.nunique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:56:07.717463Z",
+ "start_time": "2020-11-13T15:56:07.693494Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tst_click.groupby('user_id')['click_article_id'].count().min() # 注意测试集里面有只点击过一次文章的用户"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻文章信息数据表"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:20:34.183761Z",
+ "start_time": "2020-11-13T15:20:34.164770Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " click_article_id \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " 0 \n",
+ " 1513144419000 \n",
+ " 168 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 1 \n",
+ " 1 \n",
+ " 1405341936000 \n",
+ " 189 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 2 \n",
+ " 1 \n",
+ " 1408667706000 \n",
+ " 250 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 3 \n",
+ " 1 \n",
+ " 1408468313000 \n",
+ " 230 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 4 \n",
+ " 1 \n",
+ " 1407071171000 \n",
+ " 162 \n",
+ " \n",
+ " \n",
+ " 364042 \n",
+ " 364042 \n",
+ " 460 \n",
+ " 1434034118000 \n",
+ " 144 \n",
+ " \n",
+ " \n",
+ " 364043 \n",
+ " 364043 \n",
+ " 460 \n",
+ " 1434148472000 \n",
+ " 463 \n",
+ " \n",
+ " \n",
+ " 364044 \n",
+ " 364044 \n",
+ " 460 \n",
+ " 1457974279000 \n",
+ " 177 \n",
+ " \n",
+ " \n",
+ " 364045 \n",
+ " 364045 \n",
+ " 460 \n",
+ " 1515964737000 \n",
+ " 126 \n",
+ " \n",
+ " \n",
+ " 364046 \n",
+ " 364046 \n",
+ " 460 \n",
+ " 1505811330000 \n",
+ " 479 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " click_article_id category_id created_at_ts words_count\n",
+ "0 0 0 1513144419000 168\n",
+ "1 1 1 1405341936000 189\n",
+ "2 2 1 1408667706000 250\n",
+ "3 3 1 1408468313000 230\n",
+ "4 4 1 1407071171000 162\n",
+ "364042 364042 460 1434034118000 144\n",
+ "364043 364043 460 1434148472000 463\n",
+ "364044 364044 460 1457974279000 177\n",
+ "364045 364045 460 1515964737000 126\n",
+ "364046 364046 460 1505811330000 479"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#新闻文章数据集浏览\n",
+ "item_df.head().append(item_df.tail())"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:28:13.084501Z",
+ "start_time": "2020-11-13T15:28:13.062561Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "176 3485\n",
+ "182 3480\n",
+ "179 3463\n",
+ "178 3458\n",
+ "174 3456\n",
+ "183 3432\n",
+ "184 3427\n",
+ "173 3414\n",
+ "180 3403\n",
+ "177 3391\n",
+ "170 3387\n",
+ "187 3355\n",
+ "169 3352\n",
+ "185 3348\n",
+ "175 3346\n",
+ "181 3330\n",
+ "186 3328\n",
+ "189 3327\n",
+ "171 3327\n",
+ "172 3322\n",
+ "165 3308\n",
+ "188 3288\n",
+ "167 3269\n",
+ "190 3261\n",
+ "192 3257\n",
+ "168 3248\n",
+ "193 3225\n",
+ "166 3199\n",
+ "191 3182\n",
+ "194 3164\n",
+ " ... \n",
+ "601 1\n",
+ "857 1\n",
+ "1977 1\n",
+ "1626 1\n",
+ "697 1\n",
+ "1720 1\n",
+ "696 1\n",
+ "706 1\n",
+ "592 1\n",
+ "1605 1\n",
+ "586 1\n",
+ "582 1\n",
+ "1606 1\n",
+ "972 1\n",
+ "716 1\n",
+ "584 1\n",
+ "1608 1\n",
+ "715 1\n",
+ "841 1\n",
+ "968 1\n",
+ "964 1\n",
+ "587 1\n",
+ "1099 1\n",
+ "1355 1\n",
+ "711 1\n",
+ "845 1\n",
+ "710 1\n",
+ "965 1\n",
+ "847 1\n",
+ "1535 1\n",
+ "Name: words_count, Length: 866, dtype: int64"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_df['words_count'].value_counts()"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:28:59.029535Z",
+ "start_time": "2020-11-13T15:28:58.816106Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "461\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "print(item_df['category_id'].nunique()) # 461个文章主题\n",
+ "item_df['category_id'].hist()"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(364047, 4)"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_df.shape # 364047篇文章"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻文章embedding向量表示"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " article_id \n",
+ " emb_0 \n",
+ " emb_1 \n",
+ " emb_2 \n",
+ " emb_3 \n",
+ " emb_4 \n",
+ " emb_5 \n",
+ " emb_6 \n",
+ " emb_7 \n",
+ " emb_8 \n",
+ " ... \n",
+ " emb_240 \n",
+ " emb_241 \n",
+ " emb_242 \n",
+ " emb_243 \n",
+ " emb_244 \n",
+ " emb_245 \n",
+ " emb_246 \n",
+ " emb_247 \n",
+ " emb_248 \n",
+ " emb_249 \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " -0.161183 \n",
+ " -0.957233 \n",
+ " -0.137944 \n",
+ " 0.050855 \n",
+ " 0.830055 \n",
+ " 0.901365 \n",
+ " -0.335148 \n",
+ " -0.559561 \n",
+ " -0.500603 \n",
+ " ... \n",
+ " 0.321248 \n",
+ " 0.313999 \n",
+ " 0.636412 \n",
+ " 0.169179 \n",
+ " 0.540524 \n",
+ " -0.813182 \n",
+ " 0.286870 \n",
+ " -0.231686 \n",
+ " 0.597416 \n",
+ " 0.409623 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 1 \n",
+ " -0.523216 \n",
+ " -0.974058 \n",
+ " 0.738608 \n",
+ " 0.155234 \n",
+ " 0.626294 \n",
+ " 0.485297 \n",
+ " -0.715657 \n",
+ " -0.897996 \n",
+ " -0.359747 \n",
+ " ... \n",
+ " -0.487843 \n",
+ " 0.823124 \n",
+ " 0.412688 \n",
+ " -0.338654 \n",
+ " 0.320786 \n",
+ " 0.588643 \n",
+ " -0.594137 \n",
+ " 0.182828 \n",
+ " 0.397090 \n",
+ " -0.834364 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 2 \n",
+ " -0.619619 \n",
+ " -0.972960 \n",
+ " -0.207360 \n",
+ " -0.128861 \n",
+ " 0.044748 \n",
+ " -0.387535 \n",
+ " -0.730477 \n",
+ " -0.066126 \n",
+ " -0.754899 \n",
+ " ... \n",
+ " 0.454756 \n",
+ " 0.473184 \n",
+ " 0.377866 \n",
+ " -0.863887 \n",
+ " -0.383365 \n",
+ " 0.137721 \n",
+ " -0.810877 \n",
+ " -0.447580 \n",
+ " 0.805932 \n",
+ " -0.285284 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 3 \n",
+ " -0.740843 \n",
+ " -0.975749 \n",
+ " 0.391698 \n",
+ " 0.641738 \n",
+ " -0.268645 \n",
+ " 0.191745 \n",
+ " -0.825593 \n",
+ " -0.710591 \n",
+ " -0.040099 \n",
+ " ... \n",
+ " 0.271535 \n",
+ " 0.036040 \n",
+ " 0.480029 \n",
+ " -0.763173 \n",
+ " 0.022627 \n",
+ " 0.565165 \n",
+ " -0.910286 \n",
+ " -0.537838 \n",
+ " 0.243541 \n",
+ " -0.885329 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 4 \n",
+ " -0.279052 \n",
+ " -0.972315 \n",
+ " 0.685374 \n",
+ " 0.113056 \n",
+ " 0.238315 \n",
+ " 0.271913 \n",
+ " -0.568816 \n",
+ " 0.341194 \n",
+ " -0.600554 \n",
+ " ... \n",
+ " 0.238286 \n",
+ " 0.809268 \n",
+ " 0.427521 \n",
+ " -0.615932 \n",
+ " -0.503697 \n",
+ " 0.614450 \n",
+ " -0.917760 \n",
+ " -0.424061 \n",
+ " 0.185484 \n",
+ " -0.580292 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
5 rows × 251 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " article_id emb_0 emb_1 emb_2 emb_3 emb_4 emb_5 \\\n",
+ "0 0 -0.161183 -0.957233 -0.137944 0.050855 0.830055 0.901365 \n",
+ "1 1 -0.523216 -0.974058 0.738608 0.155234 0.626294 0.485297 \n",
+ "2 2 -0.619619 -0.972960 -0.207360 -0.128861 0.044748 -0.387535 \n",
+ "3 3 -0.740843 -0.975749 0.391698 0.641738 -0.268645 0.191745 \n",
+ "4 4 -0.279052 -0.972315 0.685374 0.113056 0.238315 0.271913 \n",
+ "\n",
+ " emb_6 emb_7 emb_8 ... emb_240 emb_241 emb_242 \\\n",
+ "0 -0.335148 -0.559561 -0.500603 ... 0.321248 0.313999 0.636412 \n",
+ "1 -0.715657 -0.897996 -0.359747 ... -0.487843 0.823124 0.412688 \n",
+ "2 -0.730477 -0.066126 -0.754899 ... 0.454756 0.473184 0.377866 \n",
+ "3 -0.825593 -0.710591 -0.040099 ... 0.271535 0.036040 0.480029 \n",
+ "4 -0.568816 0.341194 -0.600554 ... 0.238286 0.809268 0.427521 \n",
+ "\n",
+ " emb_243 emb_244 emb_245 emb_246 emb_247 emb_248 emb_249 \n",
+ "0 0.169179 0.540524 -0.813182 0.286870 -0.231686 0.597416 0.409623 \n",
+ "1 -0.338654 0.320786 0.588643 -0.594137 0.182828 0.397090 -0.834364 \n",
+ "2 -0.863887 -0.383365 0.137721 -0.810877 -0.447580 0.805932 -0.285284 \n",
+ "3 -0.763173 0.022627 0.565165 -0.910286 -0.537838 0.243541 -0.885329 \n",
+ "4 -0.615932 -0.503697 0.614450 -0.917760 -0.424061 0.185484 -0.580292 \n",
+ "\n",
+ "[5 rows x 251 columns]"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_emb_df.head()"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(295141, 251)"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_emb_df.shape"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 数据分析"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户重复点击"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAFgCAYAAACmDI9oAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAA9VUlEQVR4nO3dd5wlZZn28d/FkGFAgTEQB1FgQVFkFAVeZREVEAwoAmIAXVlXJSxiQH0FRVF3BcW4ApIEJYmvKAYQHBVQYAgSF5U4BGHIOQ3X+0c9jTVNh3O6u7rOmb6+n099+lR4qu7umbufu6ueqpJtIiIiIiKiskjbAURERERE9JIUyBERERERNSmQIyIiIiJqUiBHRERERNSkQI6IiIiIqEmBHBERERFRkwI5IgKQtKukc2rzD0p6wShtZkqypEXHeewbJG05nn2U/Ywac0REjC4FckPa7GwnSjrbmMpsL2v7urbj6MZExSzp9ZJ+J+kBSXdJulTSJyUtORFxRvSzhaF/j9GlQJ4kU7mzbZOkzSXd3HYcEf1C0g7AKcCPgDVsrwjsCKwKrDZMm3T6MWUtDH1lPFMK5BgTSdPajiFirCStJulUSfPKGdJvD7GNJb2wfF5K0sGSbpR0n6RzJC01RJu3l+ESLx7l+O8p+7pL0mcGrVtE0qckXVvWnyRphbLuV5I+Omj7v0javpuYJb1K0nmS7i3tNy/LBRwCfMH24bbvBrB9je09bP+tbHeApFMkHSfpfmBXSStLOk3S3ZL+LumDtRiPlvTF2vwCf7iWn9l+kq6SdI+ko3K2OiLalAJ5AvRAZztkZ1fWzZZ0oKRzy+XSMyStVNZ109keLel7kn4p6SHgXyX9S9n/vZKulPTm2n6OlvQdSaeX454vaa1BP48PS/pbWX+gpLXK93F/KQoWr22/rarLvPeWbTaorbtB0r6SLis/zxMlLSlpGeBXwMqqLoE9KGnlkX6WsfBT9cfdL4AbgZnAKsAJozT7GrARsAmwAvAJ4KlB+90N+Cqwpe0rRjj+esD3gPcAKwMrUp2dHbAH8FbgtWX9PcB3yrofAzsP2tcawOmdxixplbL9F8vyfYGfSJoBrFNi+ckIP4sBb6E60/ws4Hiqn+HNJeZ3AAdJ2qKD/QzYBXgjsBawNvDZLtpGNKIH+vc3l/713tLf/ktt3Scl3VL60GskvW4ivucobGcaxwRMA/4CfB1YBlgS2AzYFTintp2BF5bP3wFmU3XM06g6sCWoOmsDiwK7AX8faDPC8VcB7gK2ofqD5/VlfkZZPxu4lqrDWarMf6Wsey9wbm1f6wH3AksMEfPRwH3ApuU400t8nwYWB7YAHgDWqW1/F/DK8v0cD5ww6OfxM2A5YH3gMeAs4AXA8sBVwPvKthsCdwAbl5/X+4AbanHeAFxA1TGvAFwNfKis2xy4ue3/J5l6ZwJeDcwDFh20fMicLf/fHwFeOsS+BnJ23/J/dtUOjv+5QbmwDPA4VWFN+f/7utr65wNPlDyaDjxENfQB4EvAkV3G/Engh4OW/abk1WZlH0vW1p1Qfi88DLynLDsA+ENtm9WA+cD02rIvA0eXz0cDX6ytWyAvSw5/qDa/DXBt2/9XMk3tifb797VLvr8eWIzqj9y/U/W56wBzgZXLtjOBtdr+mS1MU84gj98rqQqzj9t+yPajts8ZbmNJiwDvB/ayfYvt+bbPs/1YbbO9gY8Dm9v++yjHfzfwS9u/tP2U7TOBOVQdzICjbP/V9iPAScDLyvKfAi+TtEaZ3wU4dVAsdT+zfa7tp8o+lqUqth+3fTbVWbmda9v/1PYFtp+kKpBfNmh//2X7fttXAlcAZ9i+zvZ9VGd+Nyzb7Q583/b55ed1DFVB/aravr5p+1ZXl4R/PsSxIgasBtxY/l92YiWqjvHaEbb5OPAd252Md1+ZqmMDwPZDVH9MDlgD+Gk5Y3QvVcE8H3iu7Qeozv7uVLbdmSq3uol5DWCHgf2XY2xGVYgPxPH8Wnw72X4WcDFVhz9gbu3zysDdJb4BN1IVCZ2q7+/Gss+INrXdv+8InG77TNtPUF0VWoqq6J5PVXivJ2kx2zfYHul3VHQpBfL4td3ZjtTZDfhH7fPDVIUtXXS2AwZ3iHNLsTxgcIc45HFrbq99fmSI+YHt1wA+Nuh7XI0FO9DRjhUxYC6wujq/sexO4FGqS//DeQPwWUlv72B/t1G72U3S0lTDLOrxbW37WbVpSdu3lPU/BnaW9Gqq3yW/6zLmuVRnkOv7X8b2V4BrgFuA7Tv4Plz7fCuwgqTptWWrl31BdRZs6dq65w2xv/oNgKuXfUa0qe3+fWWqfhWA0t/OBVYpxfXeVFdz7pB0QoYQTqwUyOPXdmc7UmfXiU462wGDO8TVyl/MA+od4kSaC3xp0Pe4tO0fd9DWo28SU8wFVEXqVyQtU8arbzrcxqVTOhI4RNWNaNMkvVrSErXNrgS2Ar5TH4s/jFOAbSVtVsbZf4EFfxf/D/ClgSs7kmZIektt/S+p/mj8AnDioD9SO4n5OGA7SW8sy5dUddPcqqXdx4D9JX1Q0rNVeRHw3BF+RnOB84Avl/1tAHygHAvgUmAbSStIeh5Vxz7YRyStquqGxM8AJ470Q4yYBG3377dS5Trw9E20q1H6Wds/sr1Z2cZU90DEBEmBPH5td7bDdnYdxj9qZzuM86nO1H5C0mKqbgzcjtFvdhqLw4EPSdq4dNbLSHrToLNVw7kdWFHS8g3EFX3I9nyq/6svBG6iurFsx1Ga7QtcDlwI3E3VES3w+9P2X4BtgcMlbT3C8a8EPkL1GLXbqG7Cq59NOhQ4DThD0gPAn6nG3w+0fww4Fdiy7KOrmEsx+xaq+wfmURUBHx/4fmyfCLyTavjWXKpO/yTgMODkEY63M9U4yFuphm/tb/u3Zd0PqcZy3gCcwdDF74/KuuuozsB9cYhtIiZT2/37ScCbJL1O0mJUf7w+BpwnaR1JW5R9P0p11bXT/js60dbg54Vpojpz+v+oxu/dCXyTkQfxLwV8g+qvwPuAP5RlM8t2i5btZlEVeFuPcvyNgd9TdYLzqIZNrF7WzQb+rbbtAnGVZT8ox33FoOWDb9L74qD165fj3kd1g9LbausW2J5n3pTz9L7L/DnArrX5LwJH1Oa3ouro76X6hXUy5YYgqk53y9q2BwDH1eaPLP8291JuaMiUKVPvTINzOFOmXpl6oH9/W+lf7yv97fpl+QZUBfwDpe//Rfq3iZ1UftARERGtkHQD1R/yvx1t24iIyZAhFhERE0zSLvrns7fr05VtxxYREaPLGeQ+IGkX4PtDrLrR9vqTHU9ERESMX/r33pUCOSIiIiKiptNHl0yKlVZayTNnzmw7jIiectFFF91pe0bbcQwlORuxoORrRH8ZLmd7qkCeOXMmc+bMaTuMiJ4i6cbRt2pHcjZiQcnXiP4yXM7mJr2IiIiIiJoUyBERERERNSmQIyIiIiJqUiBHRERERNT01E16E22jjx/bdggxBV303+9tO4S+lZyNyZZ8Hbvka7RhsnI2Z5AjIiIiImpSIEdERERE1KRAjoiIiIioSYEcEREREVGTAjkiIiIioiYFckRERERETQrkiIiIiIiaFMgRERERETUdF8iSftjJsiG2+U9JV0q6QtKPJS3ZbZARMTaSlpM0vYvtk68RfSQ5G9GMbs4gr1+fkTQN2GikBpJWAfYEZtl+MTAN2KnbICOiO5JeIely4DLgCkl/kZR8jViIJGcjmjNqgSxpP0kPABtIur9MDwB3AD/r4BiLAktJWhRYGrh1XBFHRCd+AHzY9kzbawAfAY7qoF3yNaK/JGcjGjBqgWz7y7anA/9te7kyTbe9ou39Rml7C/A14CbgNuA+22fUt5G0u6Q5kubMmzdvHN9KRNTMt/3HgRnb5wBPjtSgk3yF5GxEr0gfG9GcjodY2N5P0iqSNpH0moFppDaSng28BVgTWBlYRtK7B+33MNuzbM+aMWPGWL6HiHim30v6vqTNJb1W0neB2ZJeLunlQzXoJF8hORvRK9LHRjRn0U43lPQVqrFNVwHzy2IDfxih2ZbA9bbnlX2cCmwCHDemaCOiUy8tX/cftHxDqrzdYog2ydeI/pKcjWhIxwUy8DZgHduPddHmJuBVkpYGHgFeB8zpon1EjIHtfx1Ds+RrRH9JzkY0pJsC+TpgMaDjAtn2+ZJOAS6mGv94CXBYVxFGRNckfW6o5ba/MFyb5GtEf0nORjSnmwL5YeBSSWdRK5Jt7zlSI9v788zLvBHRrIdqn5cEtgWuHq1R8jWivyRnI5rRTYF8WpkiosfZPrg+L+lrwG9aCiciIqKvdFwg2z5G0lLA6ravaTCmiJh4SwOrth1EREREP+jmVdPbAZcCvy7zL5OUM8oRPUjS5ZIuK9OVwDXAN1oOKyIioi90M8TiAOCVwGwA25dKekEDMUXE+G1b+/wkcLvtEV8UEhEREZWOzyADT9i+b9CypyYymIiYGLZvBJ4FbEf1iMb1Wg0oIiKij3RTIF8p6V3ANEkvkvQt4LyG4oqIcZC0F3A88JwyHS9pj3ajioiI6A/dFMh7AOtTPeLtx8D9wN4NxBQR4/cBYGPbn7P9OeBVwAdbjikiIqIvdPMUi4eBz5QpInqb+Ocr4Smf1VIsERERfaXjAlnSLODTwMx6O9sbTHxYETFORwHnS/ppmX8r8IP2womIiOgf3TzF4njg48Dl5Oa8iJ4laRHgz1RPnNmsLN7N9iWtBRUREdFHuimQ59nOc48jepztpyR9x/aGwMVtxxMREdFvuimQ95d0BHAW1Y16ANg+dcKjiojxOkvS24FTbbvtYCIiIvpJNwXybsC6wGL8c4iFgRTIEb3n34F9gCclPUp1g55tL9duWBEREb2vmwL5FbbXaSySiJgwtqe3HUNERES/6qZAPk/SeravaiyaiBgXSdOApWw/WOZfBSxeVl9i+4HWgouIiOgT3bwo5FXApZKukXSZpMslXTZaI0nPknSKpP+VdLWkV4893IgYxVeBD9fmf0z19Jn/C3x2tMbJ14j+kpyNaEY3Z5C3GuMxDgV+bfsdkhYHlh7jfiJidK8DXlGbv9f2dpIE/LGD9snXiP6SnI1oQDdv0ruxXL59bqftJC0PvAbYtezjceDx7sOMiA4tYvvJ2vwnobo7T9KyIzVMvka0T9KzgdVsd3KFNjkb0ZCOh1hI2gO4HTgTOL1Mvxil2ZrAPOAoSZdIOkLSMoP2u7ukOZLmzJs3r7voI2KwxSU9fYOe7TPg6Y50yVHajpqvZV/J2YgJJGm2pOUkrUD17PLDJR3SQdP0sREN6WYM8l7AOrbXt/2SMo32mulFgZcD3ysvLXgI+FR9A9uH2Z5le9aMGTO6Cj4inuFw4ERJqw8skLQG1VjkI0ZpO2q+QnI2ogHL274f2B441vbGwJYdtEsfG9GQbgrkucB9Xe7/ZuBm2+eX+VOokjkiGmD7EOA04BxJd0m6G/gD8HPbXxulefI1oh2LSno+8E5GvzJbl5yNaEg3N+ldB8yWdDoLvklv2MtAtv8haa6kdWxfQ3UDUR4TF9Eg2/8D/M/AUItOH+2WfI1ozReA3wDn2r5Q0guAv43WKDkb0ZxuCuSbyrQ4/3yuaif2AI4vd9deR/VGvohokKTnAgcBKwNbS1oPeLXtH4zSNPkaMclsnwycXJu/Dnh7h82TsxEN6OYpFp8HGLgTfuBFBB20uxSYNZbgImLMjgaOAj5T5v8KnAiMWCAnXyMmn6RVgW8Bm5ZFfwT2sn3zaG2TsxHN6OYpFi+WdAlwJXClpIskrd9caBExDivZPgl4CqA8+m1+uyFFxDCOorp3YOUy/bwsi4iWdHOT3mHAPrbXsL0G8DGqO+Yjovc8JGlFwPD0K6e7vck2IibHDNtH2X6yTEcDeeRERIu6GYO8jO3fDczYnj3UM1IjoifsQ3VGai1J51J1tu9oN6SIGMZdkt5N9ThGgJ2Bu1qMJ2LK6+opFpL+L/DDMv9uqhsCIqLH2L5Y0muBdQAB19h+ouWwImJo76cag/x1qqs+55Gb7SJa1c0Qi/dTnYU6FfgJsFJZFhE9RtJHgGVtX2n7CmBZSR9uO66IeCbbN9p+s+0Ztp9j+622bxpYL2m/NuOLmIo6KpAlTQNOtb2n7Zfb3sj23rbvaTi+iBibD9q+d2Cm5OoH2wsnIsZhh7YDiJhqOiqQbc8HnpK0fMPxRMTEmCZJAzPlj9xunl8eEb1Do28SEROpmzHIDwKXSzqT6n3vANjec8Kjiojx+jVwoqTvl/l/L8siov+47QAipppuCuRTyxQRve+TVEXxf5T5M4Ej2gsnIsYhZ5AjJlk3b9I7pslAImLi2H4K+F6ZIqKHSVrB9t2Dlq1p+/oye/IQzSKiQaMWyJJOsv1OSZczxGUe2xs0EllEdC35GtGXfi5pa9v3A0haDzgJeDGA7YPaDC5iKurkDPJe5eu2TQYSERMi+RrRfw6iKpLfRPXs8mOBXdoNKWJqG7VAtn1b+fh24ATbtzYbUkSMVfI1ov/YPl3SYsAZwHTgbbb/2nJYEVNaNzfpTQfOlHQ3cCJwsu3bmwkrIsYp+RrR4yR9iwWHQi0PXAt8VFKeEhXRom5u0vs88HlJGwA7Ar+XdLPtLRuLLiLGJPka0RfmDJq/qJUoIuIZujmDPOAO4B/AXcBzOmlQXlIwB7jFdsZGRkye5GtEjxp4OpSkZYBHy0u5BnJwiU72kXyNaEZHb9IDkPRhSbOBs4AVqV5l2+kd8XsBV3cfXkSMRfI1oq+cBSxVm18K+G2HbZOvEQ3ouEAGVgP2tr2+7QNsX9VJI0mrAm8iLymImEzJ14j+saTtBwdmyuelR2uUfI1oTscFsu39qF41vbKk1QemDpp+A/gE8NRQKyXtLmmOpDnz5s3rNJyIGEHJ12Ul7QYgaYakNTto+g1GyNeyr+RsxMR6SNLLB2YkbQQ80kG7b5B8jWhEN0MsPgrcTvXK2tPL9ItR2mwL3GF72BsPbB9me5btWTNmzOg0nIgYgaT9qV43vV9ZtBhw3ChtRs1XSM5GNGBv4GRJf5R0DtWTZz46UoPka0SzurlJb29gHdt3ddFmU+DNkrYBlgSWk3Sc7Xd3sY+I6N7bgA2BiwFs3ypp+ihtkq8RLbB9oaR1qV4SAnCN7SdGaZZ8jWhQN2OQ5wL3dbNz2/vZXtX2TGAn4Owkb8SkeNy2Kc9YLXfJjyj5GjG5JG1Rvm4PbAesXabtyrJhJV8jmtXNGeTrgNmSTgceG1ho+5AJjyoixuskSd8HniXpg8D7gcNbjikiFvRa4Gyq4ngwA6dObjgRMaCbAvmmMi1epq7Yng3M7rZdRHTP9tckvR64n+qy7edsn9lF+9kkXyMaZXv/8nW3ce5nNsnXiAnV7Zv0kLS07YebCykiJkIpiDsuiiNicknaZ6T1uUIb0Z6OC2RJrwZ+ACwLrC7ppcC/2/5wU8FFRHckPUAZdzwU28tNYjgRMbKRbpwdNo8jonndDLH4BvBG4DQA23+R9JomgoqIsbE9HUDSgcBtwA8BAbsAz28xtIgYpHZl9hhgL9v3lvlnAwe3GFrElNfNUyywPXfQovkTGEtETJw32/6u7Qds32/7e8Bb2g4qIoa0wUBxDGD7HqrHNEZES7p6zJukTQBLWkzSvuT97xG96iFJu0iaJmkRSbsAD7UdVEQMaZFy1hgASSvQ3RXeiJhg3STgh4BDgVWAW4AzgI80EVREjNu7qPL1UKqxjOeWZRHRew4G/iTp5DK/A/ClFuOJmPK6eYrFnVTjGIckaT/bX56QqCJiXGzfwAhDKpKvEb3D9rGS5gBblEXb276qzZgiprquxiCPYocJ3FdENCv5GtFDbF9l+9tlSnEc0bKJLJA1gfuKiGYlXyMiIoYxkQVyntkY0T+SrxEREcPIGeSIqSn5GhERMYyOC+Ty2JnBy9aszZ48eH1EtCP5GhERMXbdnEH+uaSnX1MraT3g5wPztg+ayMAiYlySrxEREWPUTYF8EFWnu6ykjajOQL27mbAiYpySrxEREWPUzXOQT5e0GNULQqYDb7P918Yii4gxS75GRESM3agFsqRvseAd78sD1wIflYTtPUdouxpwLPDcso/DbB86vpAjYjjJ14ipIzkb0ZxOziDPGTR/URf7fxL4mO2LJU0HLpJ0Zh6CHtGY5GvE1JGcjWjIqAWy7WMAJC0DPGp7fpmfBiwxStvbgNvK5wckXQ2sAiR5IxqQfI2YOpKzEc3p5ia9s4ClavNLAb/ttLGkmcCGwPmDlu8uaY6kOfPmzesinIgYQSP5WtYlZyN6TPrYiInVTYG8pO0HB2bK56U7aShpWeAnwN6276+vs32Y7Vm2Z82YMaOLcCJiBI3ka9lXcjaih6SPjZh43RTID0l6+cBMeXTUI6M1KnfS/wQ43vap3YcYEWOQfI2YApKzEc3o+DFvwN7AyZJupXpN7fOAHUdqIEnAD4CrbR8y1iAjomt7k3yNWKglZyOa081zkC+UtC6wTll0je0nRmm2KfAe4HJJl5Zln7b9y64jjYiOJV8jpoTkbERDOnkO8ha2z5a0/aBVa5fnqg57Scf2OVRnryJiEiRfI6aO5GxEczo5g/xa4GxguyHWGciYp4jekXyNiIgYp06eg7x/+bpb8+FExHgkXyMiIsavkyEW+4y0PjcGRPSO5GtERMT4dTLEYvoI6zxRgUTEhEi+RkREjFMnQyw+DyDpGGAv2/eW+WcDBzcaXUR0JfkaERExft28KGSDgc4WwPY9VK+1jIjek3yNiIgYo24K5EXKWSgAJK1Ady8aiYjJk3yNiIgYo246zIOBP0k6uczvAHxp4kOKiAmQfI2IiBijbt6kd6ykOcAWZdH2tq9qJqyIGI/ka0RExNh1dcm1dLDpZCP6QPI1IiJibLoZgxwRERERsdBLgRwRERERUZMCOSIiIiKiJgVyRERERERNCuSIiIiIiJrGC2RJW0m6RtLfJX2q6eNFxNglXyP6S3I2ohmNFsiSpgHfAbYG1gN2lrRek8eMiLFJvkb0l+RsRHOaPoP8SuDvtq+z/ThwAvCWho8ZEWOTfI3oL8nZiIY0XSCvAsytzd9clkVE70m+RvSX5GxEQ7p6k14TJO0O7F5mH5R0TZvxxNNWAu5sO4h+pK+9b6J3ucZE73A8krM9Kzk7BsnXaEnydYwmK2ebLpBvAVarza9alj3N9mHAYQ3HEV2SNMf2rLbjiEk1ar5CcrZXJWenpPSxfSr52vuaHmJxIfAiSWtKWhzYCTit4WNGxNgkXyP6S3I2oiGNnkG2/aSkjwK/AaYBR9q+ssljRsTYJF8j+ktyNqI5st12DNGDJO1eLs1FRB9Izkb0j+Rr70uBHBERERFRk1dNR0RERETUpECOiIiIiKhJgRwRERERUZMCOYYk6di2Y4iIoUl6paRXlM/rSdpH0jZtxxURsbBo/U160T5Jg5+bKeBfJT0LwPabJz2oiBiSpP2BrYFFJZ0JbAz8DviUpA1tf6nVACNiAZKWB/YD3go8BzBwB/Az4Cu2720tuBhWnmIRSLoYuAo4gipxBfyY6qHz2P59e9FFRJ2ky4GXAUsA/wBWtX2/pKWA821v0GZ8EbEgSb8BzgaOsf2Psux5wPuA19l+Q5vxxdAyxCIAZgEXAZ8B7rM9G3jE9u9THEf0nCdtz7f9MHCt7fsBbD8CPNVuaBExhJm2vzpQHAPY/oftrwJrtBhXjCBDLALbTwFfl3Ry+Xo7+b8R0asel7R0KZA3GlhYLuOmQI7oPTdK+gTVGeTbASQ9F9gVmNtmYDG8nEGOp9m+2fYOwK+A49qOJyKG9JpSHA/8cTtgMapLthHRW3YEVgR+L+keSXcDs4EVgHe2GVgML2OQIyIiIhokaV1gVeDPth+sLd/K9q/biyyGkzPIEREREQ2RtCfVEys+Clwh6S211Qe1E1WMJuNMIyIiIprzQWAj2w9KmgmcImmm7UOpnhoVPShnkKcASed1uf3mkn7RVDwRMbLkbMRCZZGBYRW2bwA2B7aWdAgpkHtWCuQpwPYmbccQEZ1LzkYsVG6X9LKBmVIsbwusBLykraBiZCmQpwBJD5avm0uaLekUSf8r6XhJKuu2KssuBravtV1G0pGSLpB0ycDYKUmHSvpc+fxGSX+QlP9PERMgORuxUHkv1Ut9nmb7SdvvBV7TTkgxmoxBnno2BNYHbgXOBTaVNAc4HNgC+DtwYm37zwBn235/efX0BZJ+S/XazAsl/RH4JrDNoEdORcTESM5G9DHbN4+w7tzJjCU6l7MHU88F5XnHTwGXAjOBdYHrbf/N1XP/6s9AfgPwKUmXUj23cUlg9fIc1g8CZwLftn3tpH0HEVNLcjYiYpLlDPLU81jt83xG/z8g4O22rxli3UuAu4CVJyi2iHim5GxExCTLGeQA+F9gpqS1yvzOtXW/AfaojXvcsHxdA/gY1eXfrSVtPInxRkx1ydmIHpGnziycUiAHth8FdgdOLzf83FFbfSDVK2wvk3QlcGDpeH8A7Gv7VuADwBGSlpzk0COmpORsRO/IU2cWTnnVdERERMQYSXrQ9rKSNgcOAO4EXgxcBLzbtiVtBXwDeBg4B3iB7W0lLQN8q2y/GHCA7Z9JOhS4y/YXJL2R6ubbzXNj7eTJGOSIiIiIiZGnziwkMsQiIiIiYmLkqTMLiZxBjoiIiJgYeerMQiJnkCMiIiKak6fO9KEUyBERERENyVNn+lOeYhERERERUZMzyBERERERNSmQIyIiIiJqUiBHRERERNSkQI6IiIiIqEmBHBERERFRkwI5IiIiIqImBXJERERERE0K5IiIiIiImhTIERERERE1KZAjIiIiImpSIEdERERE1KRAjoiIiIioSYHcgyTtKumc2vyDkl4wSpuZkixp0eYjjIiR9GsOS9pF0hltHT+iE72cX5I2lfS3EtNbmzxWNCvFVB+wvWzbMUwUSTOB64HFbD/ZcjgRk6Jfctj28cDxbccR0Y0ey68vAN+2fWjbgQwm6QDghbbf3XYs/SBnkKPn5Cx4xNgkdyKa02F+rQFcOVH7lzRtvPuIsUmB3DJJq0k6VdI8SXdJ+vYQ21jSC8vnpSQdLOlGSfdJOkfSUkO0ebukGyS9eJTjbybpPEn3SporadeyfHlJx5a4bpT0WUmLlHUHSDquto8FLl1Jmi3pQEnnSnpA0hmSViqb/6F8vbdcgnp1uVx2rqSvS7oL+IKkuyW9pHaM50h6WNKMbn6+EU1rM4drufcBSTcBZ5fl75d0taR7JP1G0hq1Nm+QdE059ncl/V7Sv5V1gy9dbyLpwrLthZI2qa0bKc8jJkQ/5Zeka4EXAD8v/dsSpS/9gaTbJN0i6YsqRe8Qfd8Bko6W9D1Jv5T0EPCvklaW9JPyM7he0p61GA+QdIqk4yTdD+w6zPeyFfBpYMcS218k7SDpokHb7SPpZ+Xz0ZL+R9KZJcd/P+h3ybpl3d3ld8o7h/tZ9qMUyC0qSfIL4EZgJrAKcMIozb4GbARsAqwAfAJ4atB+dwO+Cmxp+4oRjr8G8CvgW8AM4GXApWX1t4DlqZL9tcB7gd06/NYA3lW2fw6wOLBvWf6a8vVZtpe1/acyvzFwHfBc4ECqn0P9MtDOwFm253URQ0Sj2s7hmtcC/wK8UdJbqDrC7any+o/Aj8t+VwJOAfYDVgSuKXEM9b2tAJwOfLNsewhwuqQVa5sNl+cR49Zv+WV7LeAmYLvSvz0GHA08CbwQ2BB4A/BvtX3X+74vlWXvKp+nA+cBPwf+Ur7/1wF7S3pjbR9vocrrZzHMECnbvwYOAk4ssb0UOA1YU9K/1DZ9D3BsbX4Xqj55Jar64HgAScsAZwI/osr/nYDvSlpvyJ9gP7KdqaUJeDUwD1h00PJdgXNq86ZKrkWAR4CXDrGvmWW7fYGrgFU7OP5+wE+HWD4NeBxYr7bs34HZ5fMBwHFDHHvRMj8b+Gxt/YeBXw+1be37vWlQDBtT/aJRmZ8DvLPtf7NMmepTD+TwQJsX1Jb9CvhAbX4R4GGqS7/vBf5UWydgLvBvg+Om6igvGHS8PwG7ls/D5nmmTBMx9Vt+lfkbqApvqIrex4ClatvvDPyu9n0M7vuOBo6tzW88xDb7AUeVzwcAf+jw53kAtb67LPse8KXyeX3gHmCJWiwn1LZdFpgPrAbsCPxx0L6+D+zf9v+biZpyBrldqwE3uvOb1VYClgSuHWGbjwPfsX1zh8cfal8rAYtR/dU+4Eaqv1479Y/a54epEmskc+szts8v7TaXtC7VL7/Tujh+xGRoO4cH1PNnDeBQVcOm7gXupiqEVwFWrm/rqlcb7jgrs+DvAHjm74Fu8zyiG/2WX4OtQdWX3lbb/vtUZ1yH2vdwx1t5oH3Zx6epiu+R9tGpY4B3SRLVH8UnuTrz/Yx9236Q6vtducS18aC4dgGeN45YekoGc7drLrC6pEU7/AVwJ/AosBbV5ZahvAH4taR/2P5JB8d/5TDHeYIqAa4qy1YHbimfHwKWrm3fTUK4i+XHUA2z+Adwiu1HuzhOxGRoO4cH1PNnLtUZoWdcapX0ImDV2rzq84PcSvU7oG514NcdxhQxXn2VX0OYS3UGeaUR4h+q7xt8vOttv6jD+EbyjO1s/1nS48D/oRra8a5Bm6w28EHSslTDVm4tcf3e9us7PHbfyRnkdl0A3AZ8RdIykpaUtOlwG9t+CjgSOKQM2p+m6ia3JWqbXQlsBXxH0ptHOf7xwJaS3ilpUUkrSnqZ7fnAScCXJE0vY5X3AQZuzLsUeI2k1SUtT3W5p1PzqMaDjfjMyuI44G1URfKxo2wb0Ya2c3go/wPsJ2l9ePqG2x3KutOBl0h6q6qbaj/C8H/g/hJYW9K7yu+HHYH1qMaERkyGfsuvwfHcBpwBHCxpOUmLSFpL0mu7ON4FwAOSPqnqBsRpkl4s6RVjiP12YKbKDfc1xwLfBp6wfc6gdduoupl/caqxyH+2PZfq98Dakt4jabEyvWLQeOa+lgK5RaUQ3Y5q+MBNVJc6dxyl2b7A5cCFVJc6vsqgf0fbfwG2BQ6XtPUIx78J2Ab4WNnXpcBLy+o9qM4UXwecQzUQ/8jS7kzgROAy4CK66DBtP0x188G55bLMq0bYdi5wMdVfvX/s9BgRk6XtHB4mpp+WfZ6g6q72K4Cty7o7gR2A/wLuoip451Cd5Rq8n7tKDB8r234C2LbsI6Jx/ZZfw3gv1Q2sV1GN7z0FeH4Xx5tfYn0Z1TsE7gSOoLqJvlsnl693Sbq4tvyHwIv550mwuh8B+1P9LDei3Dxv+wGqs/E7UZ1R/gfVz2WJIfbRlwZugIroSZKOBG61/dm2Y4lY2JQzSTcDu9j+XdvxRMTkU/UYvDuAl9v+W2350cDNU7X/zRjk6Fmq3rq3PdWjcSJiApTHQ51Pdbf/x6luMPpzq0FFRJv+A7iwXhxHhlgs9CTtouqh4IOnMb3pZ7JIOpDq0tV/276+7Xgi2tJADr+a6i7/O6kuX7/V9iMTFnBEH+nXPnI4kn41zPfz6WG2vwHYi2ooVdRkiEVERERERE3OIEdERERE1PTUGOSVVlrJM2fObDuMiJ5y0UUX3Wl7RttxDCU5G7Gg5GtEfxkuZ3uqQJ45cyZz5sxpO4yIniJp8NvMekZyNmJBydeI/jJczmaIRURERERETQrkiIiIiIiaFMgRERERETU9NQZ5om308WPbDiGmoIv++71th9C3krMx2ZKvY5d8jTZMVs7mDHJERERERE0K5IiIiIiImhTIERERERE1KZAjIiIiImpSIEdERERE1KRAjoiIiIioSYEcEREREVGTAjkiIiIioqbxAlnSf0q6UtIVkn4sacmmjxkRY5N8jZh8ktaWdJakK8r8BpI+22Hb5GxEAxotkCWtAuwJzLL9YmAasFOTx4yIsUm+RrTmcGA/4AkA25fRQe4lZyOaMxlDLBYFlpK0KLA0cOskHDMixib5GjH5lrZ9waBlT3bYNjkb0YBGC2TbtwBfA24CbgPus31GfRtJu0uaI2nOvHnzmgwnIkbQSb5CcjaiAXdKWgswgKR3UOXgiNLHRjSn6SEWzwbeAqwJrAwsI+nd9W1sH2Z7lu1ZM2bMaDKciBhBJ/kKydmIBnwE+D6wrqRbgL2BD43WKH1sRHOaHmKxJXC97Xm2nwBOBTZp+JgRMTbJ14gW2L7O9pbADGBd25vZvrGDpsnZiIY0XSDfBLxK0tKSBLwOuLrhY0bE2CRfI1ogaUVJ3wT+CMyWdKikFTtompyNaEjTY5DPB04BLgYuL8c7rMljRsTYJF8jWnMCMA94O/CO8vnE0RolZyOas2jTB7C9P7B/08eJiPFLvka04vm2D6zNf1HSjp00TM5GNCNv0ouIiGjXGZJ2krRImd4J/KbtoCKmshTIERER7fog8CPgsTKdAPy7pAck3d9qZBFTVONDLCIiImJ4tqe3HUNELChnkCMiIlok6SeStpGUPjmiR3ScjJJe0mQgERERU9T3gF2Av0n6iqR12g4oYqrr5q/V70q6QNKHJS3fWEQRERFTiO3f2t4FeDlwA/BbSedJ2k3SYu1GFzE1dVwg2/4/VH/hrgZcJOlHkl7fWGQRERFTRHkxyK7AvwGXAIdSFcxnthhWxJTV1U16tv8m6bPAHOCbwIbl7T2ftn1qEwFGxNhIWgVYg1qe2/5DexFFxFAk/RRYB/ghsJ3t28qqEyXNaS+yiKmr4wJZ0gbAbsCbqP6i3c72xZJWBv5E9Q74iOgBkr4K7AhcBcwviw2kQI7oPYfb/mV9gaQlbD9me1ZbQUVMZd2cQf4WcATV2eJHBhbavrWcVY6I3vFWYB3bj7UdSESM6ovALwct+xPVEIuIaEFHBbKkacAttn841PrhlkdEa64DFqN66UBE9CBJzwNWAZaStCGgsmo5YOnWAouIzgpk2/MlrSZpcduPNx1URIzbw8Clks6iViTb3rO9kCJikDdS3Zi3KnAw/yyQ7wc+3VJMEUF3QyyuB86VdBrw0MBC24dMeFQRMV6nlSkiepTtY4BjJL3d9k+G207S+8q2ETFJuimQry3TIsDAazE94RFFxLjZPkbS4sDaZdE1tp9oM6aIGNpIxXGxF5ACOWISdVMgX2X75PoCSTtMcDwRMQEkbU7Vod5Addl2tXIWKk+xiOg/Gn2TiJhI3bxJb78Oly1A0rMknSLpfyVdLenVXRwzIsbmYOANtl9r+zVUYx2/Plqj5GtETxr2am1yNqIZo55BlrQ1sA2wiqRv1lYtBzzZwTEOBX5t+x3lkm/uzI1o3mK2rxmYsf3XDl9Zm3yN6D0jnUFOzkY0oJMhFrdSvTnvzcBFteUPAP85UkNJywOvobpLl/IEjDwFI6J5cyQdARxX5nehyuNhJV8jJp+kRYB32D5phM3OHaZtcjaiIaMWyLb/AvxF0o/GcJPPmsA84ChJL6UqsPey/dDIzSJinP4D+Agw8Fi3PwLfHaVN8jViktl+StIngGELZNsfHWZVcjaiId2MQX6lpDMl/VXSdZKul3TdKG0WpXoT0Pdsb0j1eLhP1TeQtLukOZLmzJs3r7voI2JI5RW1h9jevkxf7+CteqPmKyRnIxrwW0n7lvcNrDAwddAufWxEQ7opkH8AHAJsBrwCmFW+juRm4Gbb55f5Uxj06kzbh9meZXvWjBkzuggnIgaTdFL5ermkywZPozQfNV8hORvRgB2prvj8geos8EWMMiSqSB8b0ZBuHvN2n+1fdbNz2/+QNFfSOuWGodcBV3UVYUR0Y6/yddtuGyZfI9phe80xtkvORjSkmwL5d5L+GziVBV9de/Eo7fYAji93114H7NZ1lBHREdu3la83jnEXydeISSZpaWAfYHXbu0t6EbCO7V900Dw5G9GAbgrkjcvXWbVlBrYYqZHtSwe1iYiGSXqAZz479T6qy7Yfsz3k/QPJ14hWHEU1rGKTMn8LcDIwaoGcnI1oRscFsu1/bTKQiJhQ36Aan/gjqmeo7gSsBVwMHAls3lZgEfEMa9neUdLOALYflpS350W0qOMCWdLnhlpu+wsTF05ETJA3235pbf4wSZfa/qSkT7cWVUQM5XFJS1Gu+khai9pQxoiYfN08xeKh2jQf2BqY2UBMETF+D0t6p6RFyvRO4NGybtjX1kZEK/YHfg2sJul44CzgE+2GFDG1dTPE4uD6vKSvAb+Z8IgiYiLsQvUK2u9SFcR/Bt5dzlIN99KBiJhk5U16zwa2B15FNSRqL9t3thpYxBTXzU16gy0NrDpRgUTExCk34W03zOpzJjOWiBjewJv0yqumT287noiodDzEYtCLB64ErqG6ESgieoyktSWdJemKMr+BpM+2HVdEDGmsb9KLiIZ0cwa5/uKBJ4HbbT85wfFExMQ4HPg48H0A25dJ+hHwxVajioih7Fi+fqS2zMALWoglIuhuDPKNkl4K/J+y6A/AaK+ujYh2LG37gkFPisoftBE9poxB/pTtE9uOJSL+qZshFnsBxwPPKdPxkvZoKrCIGJc7y6OiBh4b9Q7gtnZDiojBbD9FdbUnInpIN0MsPgBsbPshAElfBf4EfKuJwCJiXD4CHAasK+kW4HqqJ1tERO/5raR9gROpHqUKgO272wspYmrrpkAW1fOPB8wvyyKih0iaBnzY9paSlgEWsf1A23FFxLAyBjmix3RTIB8FnC/pp2X+rcAPJjyiiBgX2/MlbVY+PzTa9hHRLttrth1DRCyom5v0DpE0G9isLNrN9iWNRBUR43WJpNOAk1nwku2p7YUUEUORtDSwD7C67d0lvQhYx/YvWg4tYsrquECW9CrgStsXl/nlJG1s+/zGoouIsVoSuAvYorbMQArkiN5zFHARsEmZv4Xqj9sUyBEt6WaIxfeAl9fmHxxiWUT0ANu7jbRe0n62vzxZ8UTEiNayvaOknQFsP6xBz2iMiMnV8WPeANn2wEx5NM14XlUdEe3Zoe0AIuJpj0tain8+lnEt4LF2Q4qY2ropkK+TtKekxcq0F3BdJw0lTZN0iaRcLoroDcOenUq+Rky6/YFfA6tJOh44C/hEJw2TrxHN6KZA/hDV+KhbgJuBjYHdO2y7F3B1d6FFRIM8wrrka8QkkLRp+fgHYHtgV+DHwCzbszvcTfI1ogEdF8i277C9k+3n2H6u7XfZvmNgvaT9hmonaVXgTcAR4w83IibIkGeQk68Rk+qb5eufbN9l+3Tbv7B9ZyeNk68RzZnIMcQ7AEPd9PMNqktF04dqJGl3ypno1VdffQLDiZi6JK0w+C1ckta0fX2ZPXmYpt9ghHwt+0nORkyMJyQdBqwq6ZuDV9rec5T23yD5GtGIboZYjOYZZ6QkbQvcYfui4RrZPsz2LNuzZsyYMYHhRExpP5e03MCMpPWAnw/M2z5ocINO8rW0Tc5GTIxtgbOBR6ge8zZ4GlbyNaJZE3kGeagxjZsCb5a0DdVzWZeTdJztd0/gcSPimQ6iKpLfBKwDHAvsMkqb5GvEJCpDKU6QdLXtv3TZPPka0aBGzyDb3s/2qrZnAjsBZyd5I5pn+3Tg68AZwNHA22xfOkqb5GtEOx6RdJakKwAkbSDpsyM1SL5GNKubN+mNdUxjREwSSd9iwas5ywPXAh+V1MmYxoiYfIcDHwe+D2D7Mkk/Ar7YalQRU1g3Qyx+Lmlr2/fD02MaTwJeDEOPaawrj6yZPbYwI6JDcwbNjzg+cTjJ14hJtbTtCwa9PO/JThsnXyMmXjcF8ljGNEbEJLJ9DICkZYBHbc8v89OAJdqMLSKGdWd5e97Am/TeAdzWbkgRU1vHBbLt0yUtRjWmcTrVmMa/NhZZRIzHWcCWwINlfimq3N2ktYgiYjgfAQ4D1pV0C3A9OQEV0apRC+SMaYzoS0vaHiiOsf2gpKXbDCginqlc3fmw7S3LlZ9FbD/QdlwRU10nZ5AnZExjREyqhyS93PbFAJI2onrWakT0ENvzJW1WPj/UdjwRURm1QM6Yxoi+tDdwsqRbqR7B+Dxgx1YjiojhXCLpNKqnQT1dJNs+tb2QIqa2bm7Sy5jGiD5h+0JJ61LdUAtwje0n2owpIoa1JHAXsEVtmYEUyBEt6aZAzpjGiB4naQvbZ0vaftCqtcs9A+lwI3qM7d1GWi9pP9tfnqx4IqK7AjljGiN632uBs4HthliXM1IR/WkHIAVyxCTqpkDem4xpjOhptvcvX0c8IxURfUWjbxIRE6mb5yBnTGNEj5O0z0jrbR8yWbFExITx6JtExETq5DnIGdMY0T+mj7AunWxEf8oZ5IhJ1skZ5IxpjOgTtj8PIOkYYC/b95b5ZwMHtxhaRAxD0gq27x60bE3b15fZk1sIK2JK6+Q5yBnTGNF/NhgojgFs3yNpwxbjiYjh/VzS1rbvB5C0HnAS8GIA2we1GVzEVNTJEIuMaYzoP4tIerbte6A6Q0V3N+VGxOQ5iKpIfhPVfT7HAru0G1LE1NZJh5kxjRH952DgT5IGLs3uAHypxXgiYhi2T5e0GNXLt6YDb7P915bDipjSOhliMeYxjZJWo/pL+LlUxfRhtg8dZ8wRMQrbx0qawz/fzLW97atGapN8jZhckr7FgiealgeuBT5aboLfc5T2ydmIhnRzyXUsYxqfBD5m+2JJ04GLJJ05WkcdEeNX8qybXEu+RkyuOYPmL+qyfXI2oiHdFMhdj2m0fRtwW/n8gKSrgVXortOOiEmQfI2YXLaPAZC0DPCo7fllfhqwRAftk7MRDVmki20HxjQeKOlA4DzgvzptLGkmsCFw/qDlu0uaI2nOvHnzuggnIpoyXL6WdcnZiIl1FrBUbX4p4Lfd7CB9bMTE6rhAtn0ssD1we5m2t/3DTtpKWhb4CbD3wGNsavs9zPYs27NmzJjReeQR0YiR8hWSsxENWNL2gwMz5fPSnTZOHxsx8bp67NMYxjRS7sz9CXB83roX0duSrxGteEjSy21fDCBpI+CRThomZyOa0ehzUSUJ+AFwdZ6XHNHbkq8RrdkbOFnSrVSvlX4esONojZKzEc1p+sUBmwLvAS6XdGlZ9mnbv2z4uBHRveRrRAtsXyhpXaqXhABcY/uJDpomZyMa0miBbPscqr+GI6LHJV8jJpekLWyfLWn7QavWLs9BHnHIRHI2ojl59WxEREQ7XgucDWw3xDoDGVMc0ZIUyBERES2wvX/5ulvbsUTEglIgR0REtEDSPiOtz413Ee1JgRwREdGO6SOs86RFERHPkAI5IiKiBbY/DyDpGGAv2/eW+WdTvb02IlrSzaumIyIiYuJtMFAcA9i+h+q10RHRkhTIERER7VqknDUGQNIK5ApvRKuSgBEREe06GPiTpJPL/A7Al1qMJ2LKS4EcERHRItvHSpoDbFEWbW/7qjZjipjqUiBHRES0rBTEKYojekTGIEdERERE1KRAjoiIiIioSYEcEREREVGTAjkiIiIioiYFckRERERETeMFsqStJF0j6e+SPtX08SJi7JKvEf0lORvRjEYLZEnTgO8AWwPrATtLWq/JY0bE2CRfI/pLcjaiOU2fQX4l8Hfb19l+HDgBeEvDx4yIsUm+RvSX5GxEQ5p+UcgqwNza/M3AxvUNJO0O7F5mH5R0TcMxRWdWAu5sO4h+pK+9b6J3ucZE73AYo+YrJGd7WHJ2DPo4XyF9bD9Lvo7RZOVs62/Ss30YcFjbccSCJM2xPavtOKL3JGd7U3I2hpJ87U3J197X9BCLW4DVavOrlmUR0XuSrxH9JTkb0ZCmC+QLgRdJWlPS4sBOwGkNHzMixib5GtFfkrMRDWl0iIXtJyV9FPgNMA040vaVTR4zJkwuyU0xyde+l5ydYpKzfS352uNku+0YIiIiIiJ6Rt6kFxERERFRkwI5IiIiIqImBXJERERERE0K5IiIiIgGSVpX0uskLTto+VZtxRQjS4EcI5K0W9sxRERE9CtJewI/A/YArpBUfx34Qe1EFaPJUyxiRJJusr1623FExMgkrWj7rrbjiIgFSboceLXtByXNBE4Bfmj7UEmX2N6w3QhjKK2/ajraJ+my4VYBz53MWCJidJK+AnzN9p2SZgEnAU9JWgx4r+3ftxthRNQsYvtBANs3SNocOEXSGlT9bPSgnEEOJN0OvBG4Z/Aq4DzbK09+VBExHEmX235J+fw74BO2L5S0NvAj27PajTAiBkg6G9jH9qW1ZYsCRwK72J7WVmwxvJxBDoBfAMvWk3eApNmTHk1EjGZRSYvafhJYyvaFALb/KmmJlmOLiAW9F3iyvqDk7nslfb+dkGI0OYMcEdFnJO0BbAd8BXgN8GzgVGAL4AW239NieBERfS8FckREHyrjGP8DWJvqauBc4P8BR5azUxERMUYpkCMiFiKSdrN9VNtxRET0szwHeQqQdF6X228u6RdNxRMRjfp82wFETCXpYxdOuUlvCrC9SdsxRMTEyaMZI3pH+tiFU84gTwGSHixfN5c0W9Ipkv5X0vGSVNZtVZZdDGxfa7uMpCMlXSDpkoE3AEk6VNLnyuc3SvqDpPx/ipgcz6W6M367Iaa8LCRiEqWPXTjlDPLUsyGwPnArcC6wqaQ5wOFUd8D/HTixtv1ngLNtv1/Ss4ALJP0W2A+4UNIfgW8C29h+avK+jYgpLY9mjOhN6WMXEvlrZOq5wPbNJdEuBWYC6wLX2/6bq7s2j6tt/wbgU5IuBWYDSwKr234Y+CBwJvBt29dO2ncQMcXZ/oDtc4ZZ967JjicinpY+diGRM8hTz2O1z/MZ/f+AgLfbvmaIdS+hupybN+1FRESkj11o5AxyAPwvMFPSWmV+59q63wB71MZRbVi+rgF8jOpy0taSNp7EeCMiIvpF+tg+lAI5sP0osDtwermB4I7a6gOBxYDLJF0JHFgS+QfAvrZvBT4AHCFpyUkOPWKhlMdGRSw80sf2p7woJCKiz5W36u1re9uWQ4mIWCjkDHJERI/JY6MiItqVm/QiInpbHhsVETHJcvYgIqK35bFRERGTLGeQIyJ6Wx4bFRExyXIGOSKi/+SxURERDUqBHBHRZ/LYqIiIZuUxbxERERERNTmDHBERERFRkwI5IiIiIqImBXJERERERE0K5IiIiIiImhTIERERERE1KZAjIiIiImpSIEdERERE1Px/S5Yr7DgjTVUAAAAASUVORK5CYII=\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:30:20.899771Z",
+ "start_time": "2020-11-13T15:30:20.750817Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "#####merge\n",
+ "user_click_merge = trn_click.append(tst_click)"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:30:26.290038Z",
+ "start_time": "2020-11-13T15:30:25.339579Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " 30760 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 0 \n",
+ " 157507 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 1 \n",
+ " 63746 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 1 \n",
+ " 289197 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 2 \n",
+ " 36162 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " 2 \n",
+ " 168401 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " 3 \n",
+ " 36162 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " 3 \n",
+ " 50644 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " 4 \n",
+ " 39894 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " 4 \n",
+ " 42567 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id count\n",
+ "0 0 30760 1\n",
+ "1 0 157507 1\n",
+ "2 1 63746 1\n",
+ "3 1 289197 1\n",
+ "4 2 36162 1\n",
+ "5 2 168401 1\n",
+ "6 3 36162 1\n",
+ "7 3 50644 1\n",
+ "8 4 39894 1\n",
+ "9 4 42567 1"
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#用户重复点击\n",
+ "user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['click_timestamp'].agg({'count'}).reset_index()\n",
+ "user_click_count[:10]"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:34:27.418638Z",
+ "start_time": "2020-11-13T15:34:27.372761Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 311242 \n",
+ " 86295 \n",
+ " 74254 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 311243 \n",
+ " 86295 \n",
+ " 76268 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 393761 \n",
+ " 103237 \n",
+ " 205948 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 393763 \n",
+ " 103237 \n",
+ " 235689 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 576902 \n",
+ " 134850 \n",
+ " 69463 \n",
+ " 13 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id count\n",
+ "311242 86295 74254 10\n",
+ "311243 86295 76268 10\n",
+ "393761 103237 205948 10\n",
+ "393763 103237 235689 10\n",
+ "576902 134850 69463 13"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_count[user_click_count['count']>7]"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:32:53.298575Z",
+ "start_time": "2020-11-13T15:32:53.285611Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([ 1, 2, 4, 3, 6, 5, 10, 7, 13])"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_count['count'].unique()"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1 1605541\n",
+ "2 11621\n",
+ "3 422\n",
+ "4 77\n",
+ "5 26\n",
+ "6 12\n",
+ "10 4\n",
+ "7 3\n",
+ "13 1\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#用户点击新闻次数\n",
+ "user_click_count.loc[:,'count'].value_counts() "
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "###### 可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtIAAAFgCAYAAACWgJ5JAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAABBAUlEQVR4nO3deZgkVZn+/e/NIrvsgizdrYgwoAjSgtuMiAiCIIogIOOCKO7CT3FBfQVBGZ0RHREVUVYFWRRGFAZBkGFVaBAQcBh22WTfN6G53z/ilCRJVlVmVmZGZvX9ua64MiPiZMRT1f1UPBlx4oRsExERERERnZmv7gAiIiIiIkZRCumIiIiIiC6kkI6IiIiI6EIK6YiIiIiILqSQjoiIiIjoQgrpiIiIiIgupJCOiOiCpPdLOrdh/mFJL57kM7MkWdICU9z3jZI2mco2ynYmjTkiIsaXQnqI1Hlg7pUcmGNeZXtx29fXHUcnehWzpDdL+r2khyTdI+lSSZ+XtHAv4owYddPh+B6tpZAeYvPygblOkjaSdEvdcUSMAknbAb8AjgZm2l4W2B5YBVh1nM+kMIh52nQ4VkYlhXQMjKT5644hohuSVpV0gqS7yhnXA1u0saSXlPeLSNpf0k2SHpB0rqRFWnzmnaWbxssm2f97yrbukfSlpnXzSfqCpOvK+uMkLVPW/bekTzS1v0zSNp3ELOnVks6XdH/5/EZluYBvA/vY/rHtewFsX237k7avKe32lvQLST+T9CDwfkkrSTpJ0r2SrpX0oYYYD5f0tYb5Z325Lb+zPSVdJek+SYfl7HdE1CGFdE2G4MDc8sBY1p0laV9J55VLtadJWq6s6+TAfLikH0o6RdIjwBsl/VPZ/v2SrpT0tobtHC7p+5JOLvv9o6TVmn4fH5N0TVm/r6TVys/xYCkgntfQfktVl5jvL23WaVh3o6Q9JF1efp/HSlpY0mLAfwMrqbr09rCklSb6Xcb0puoL4G+Am4BZwMrAMZN87FvA+sBrgWWAzwFPN213Z+CbwCa2r5hg/2sBPwTeA6wELEt1tnfMJ4G3A28o6+8Dvl/W/RzYsWlbM4GT241Z0sql/dfK8j2AX0paHlijxPLLCX4XY7amOnO9FHAU1e/wlhLztsB+kjZuYztjdgI2A1YDXgp8uYPPRvTNEBzf31aOr/eX4+0/Naz7vKRbyzH0aklv6sXPPE+znWnAEzA/cBnwHWAxYGHg9cD7gXMb2hl4SXn/feAsqoP4/FQHu4WoDuwGFgB2Bq4d+8wE+18ZuAfYgurL1JvL/PJl/VnAdVQHp0XK/DfKuvcC5zVsay3gfmChFjEfDjwAvK7sZ4kS3xeB5wEbAw8BazS0vwfYoPw8RwHHNP0+fgU8H1gbeAI4A3gxsCRwFfC+0nY94E5gw/L7eh9wY0OcNwIXUh3ElwH+AnykrNsIuKXu/yeZhmMCXgPcBSzQtLxlvpb/648Br2ixrbF83aP8f12ljf1/pSkPFgP+TlWAU/7vvqlh/QuBJ0sOLQE8QtXlAuDrwKEdxvx54KdNy35bcur1ZRsLN6w7pvxNeBR4T1m2N3B2Q5tVgbnAEg3L/g04vLw/HPhaw7pn5WTJ3480zG8BXFf3/5VMmaj/+P7SkvNvBhak+kJ8LdUxdw3gZmCl0nYWsFrdv7NRn3JGuh4bUBVwn7X9iO3HbZ87XmNJ8wEfAHazfavtubbPt/1EQ7Pdgc8CG9m+dpL9/ytwiu1TbD9t+3RgDtXBaMxhtv/P9mPAccC6ZfmJwLqSZpb5nYATmmJp9Cvb59l+umxjcaqi/O+2z6Q607djQ/sTbV9o+ymqQnrdpu39u+0HbV8JXAGcZvt62w9QnUler7TbFfiR7T+W39cRVIX3qxu2dYDt21xdjv51i31FQFX03VT+T7ZjOaqD53UTtPks8H3b7fTFX4nq4AeA7UeovnCOmQmcWM4+3U9VWM8FVrD9ENXZ5B1K2x2p8qqTmGcC241tv+zj9VQF+1gcL2yIbwfbSwGXUBUFY25ueL8ScG+Jb8xNVIVEuxq3d1PZZkTd6j6+bw+cbPt0209SXWlahKo4n0tVoK8laUHbN9qe6O9UtCGFdD3qPjBPdGAc87eG949SFcB0cGAe03zwvLkU1WOaD54t99vgjob3j7WYH2s/E/hM08+4Ks8+2E62rwio/g/PUPs3yN0NPE7V5WA8mwJflvTONrZ3Ow037UlalKp7R2N8m9teqmFa2PatZf3PgR0lvYbq78jvO4z5Zqoz0o3bX8z2N4CrgVuBbdr4Odzw/jZgGUlLNCybUbYF1Rm1RRvWrdhie403Ms4o24yoW93H95WojqsAlOPtzcDKpQjfneoK0Z2SjknXxalLIV2Pug/MEx0Y29HOgXlM88Fz1fINfEzjwbOXbga+3vQzLmr752181pM3iXnIhVTF7DckLVb60r9uvMblwHUo8G1VN9TNL+k1khZqaHYl8Bbg+433CYzjF8CWkl5f7gHYh2f/7T4I+PrYVSJJy0vaumH9KVRfLPcBjm36IttOzD8DtpK0WVm+sKqb/1Ypn/sMsJekD0laWpXVgRUm+B3dDJwP/FvZ3jrALmVfAJcCW0haRtKKVAf/Zh+XtIqqGyu/BBw70S8xYkDqPr7fRpXvwD9uCF6Vcpy1fbTt15c2prpPI6YghXQ96j4wj3tgbDP+SQ/M4/gj1Znfz0laUNUNjlsx+Y1b3fgx8BFJG5YD+2KS3tp0Bmw8dwDLSlqyD3HFiLE9l+r/6UuAv1LdILf9JB/bA/gzcBFwL9XB6ll/b21fBmwJ/FjS5hPs/0rg41TDy91OdTNh45mp7wInAadJegj4A9W9AWOffwI4AdikbKOjmEvRuzXVvQ13URUKnx37eWwfC7yLqsvYzVSFwXHAwcDxE+xvR6o+mrdRdRnby/bvyrqfUvUzvRE4jdZF8tFl3fVUZ/O+1qJNxKDVfXw/DnirpDdJWpDqi+4TwPmS1pC0cdn241RXcds9fsd46uqcPa9PVGdi/4uqj+HdwAFMfDPCIsB/Un2rfAA4uyybVdotUNrNpioEN59k/xsC/0N1wLyLqrvGjLLuLOCDDW2fFVdZdkjZ76ualjffbPi1pvVrl/0+QHWz1Tsa1j2rPc+9wegf2y7z5wLvb5j/GvCThvm3UBUF91P9YTuecnMT1QF6k4a2ewM/a5g/tPzb3E+5MSNTpkzDMTXnb6ZMwzQNwfH9HeX4+kA53q5dlq9DVeg/VI79v8nxbeqTyi83IiJiJEi6kerL/u8maxsR0U/p2hERUTNJO+mZccsbpyvrji0iIsaXM9LTlKSdgB+1WHWT7bUHHU9ERERMXY7vwyWFdEREREREF9odnmUkLLfccp41a1bdYUQM3MUXX3y37eXrjqNTydmYV41iziZfY141Ub5Oq0J61qxZzJkzp+4wIgZO0k2Ttxo+ydmYV41iziZfY141Ub7mZsOIiIiIiC70tZCWdKikOyVd0bBsGUmnS7qmvC49zmffV9pcI+l9/YwzIirJ2YjRkXyNqF+/z0gfTvVQjEZfAM6wvTpwRpl/lvLI172oHhqyAdXjZ1v+MYiInjqc5GzEqDic5GtErfpaSNs+m+rpOY22Bo4o748A3t7io5sBp9u+1/Z9wOk8949FRPRYcjZidCRfI+pXx82GK9i+vbz/G7BCizYrAzc3zN9Slj2HpF2BXQFmzJjRsyDX/+yRPdtWRCsX/8d76w6hXcnZCEYmZ5OvEQwuX2u92dDVINZTGsja9sG2Z9uevfzyIzWSUMTISc5GjI7ka0T/1VFI3yHphQDl9c4WbW4FVm2YX6Usi4jBS85GjI7ka8QA1VFInwSM3SH8PuBXLdr8FthU0tLlBohNy7KIGLzkbMToSL5GDFC/h7/7OXABsIakWyTtAnwDeLOka4BNyjySZkv6CYDte4F9gYvKtE9ZFhF9lJyNGB3J14j69fVmQ9s7jrPqTS3azgE+2DB/KHBon0KLiBaSsxGjI/kaUb882TAiIiIiogsppCMiIiIiupBCOiIiIiKiC20X0pJ+2s6yiBgekp4vaYm644iIiJiOOjkjvXbjjKT5gfV7G05E9IKkV0n6M3A5cIWkyyQlXyMiInpo0kJa0p6SHgLWkfRgmR6iGuS91fiUEVG/Q4CP2Z5leybwceCwmmOKiIiYViYtpG3/m+0lgP+w/fwyLWF7Wdt7DiDGiOjcXNvnjM3YPhd4qsZ4IiIipp22x5G2vaeklYGZjZ+zfXY/AouIKfkfST8Cfg4Y2B44S9IrAWxfUmdwERER00HbhbSkbwA7AFcBc8tiAymkI4bPK8rrXk3L16PK240HG05ERMT008mTDd8BrGH7iX4FExG9YfuNdccQEREx3XVSSF8PLAikkI4YcpK+0mq57X0GHUtERMR01Ukh/ShwqaQzaCimbX+q51FFxFQ90vB+YWBL4C81xRIRETEtdVJIn1SmiBhytvdvnJf0LeC3NYUTERExLXUyascRkhYBZti+uo8xRUTvLQqsUncQERER00knjwjfCrgUOLXMryspZ6gjhpCkP0u6vExXAlcD/1lzWBEREdNKJ1079gY2AM4CsH2ppBf3IaaImLotG94/BdxhOw9kiYiI6KG2z0gDT9p+oGnZ070MJiJ6w/ZNwFLAVlRDV65Va0ARERHTUCeF9JWS3g3ML2l1Sd8Dzu9TXBExBZJ2A44CXlCmoyR9st6oIiIippdOCulPAmtTDX33c+BBYPdudippDUmXNkwPStq9qc1Gkh5oaNNyXNyIaGkXYEPbX7H9FeDVwIe62VDyNWK0JGcjBqeTUTseBb5Upikpo36sCyBpfuBW4MQWTc+xvWWL5RExMQFzG+bnlmUdS75GjJbkbMTgtF1IS5oNfBGY1fg52+tMMYY3AdeVPp0R0RuHAX+UNHbwfDtwSA+2m3yNGC3J2Yg+6mTUjqOAzwJ/prc3Ge5A1VWklddIugy4DdjD9pXNDSTtCuwKMGPGjB6GFTGaJM0H/IFqhJ3Xl8U72/5TDzY/pXwt8SVnIwYnx9iIPuqkkL7Ldk/HjZb0POBtwJ4tVl8CzLT9sKQtgP8CVm9uZPtg4GCA2bNnu5fxRYwi209L+r7t9ajyqCd6ka8lvuRsxADkGBvRf53cbLiXpJ9I2lHSNmPTFPe/OXCJ7TuaV9h+0PbD5f0pwIKSlpvi/iLmFWdIeqekrvpFjyP5GjFakrMRfdbJGemdgTWBBXmma4eBE6aw/x0Z55KTpBWpHiJhSRtQFf33TGFfEfOSDwOfBp6S9DjVjYa2/fwpbDP5GjFakrMRfdZJIf0q22v0aseSFgPeTHXAH1v2EQDbBwHbAh+V9BTwGLCD7VxWimiD7SV6ub3ka8RoSc5GDEYnhfT5ktayfVUvdmz7EWDZpmUHNbw/EDiwF/uKmFeUoa4WGbtkK+nVwPPK6j/Zfqib7SZfI0ZLcjZiMDoppF8NXCrpBqqHsoxdKp7q8HcR0TvfBO4E/r3M/xy4AliY6uaiz9cUV0RExLTTSSH9lr5FERG98ibgVQ3z99veqtx0eE5NMUVERExLbY/aUQZzvwV4kuomw7EpIobHfLafapj/PFSXjoDF6wkpItolaWlJudIbMSI6ebLhJ4G9gDt49qgdSfiI4fE8SUuM9YW2fRqApCWpundExJCRdBbVeM8LABcDd0o6z/anaw0sIibVyTjSuwFr2F7b9svLlCI6Yrj8GDhW0j8eQSZpJlVf6Z/UFlVETGRJ2w8C2wBH2t4Q2KTmmCKiDZ30kb4ZeKBfgUTE1Nn+tqRHgXPL8FcCHgK+YfuH9UYXEeNYQNILgXcBX6o7mIhoXyeF9PXAWZJOphq1A6gO3D2PKiK6Voa4OkjSEmW+qyHvImJg9gF+C5xn+yJJLwauqTmmiGhDJ4X0X8v0PJ4ZlzYihpCkFYD9gJWAzSWtBbzG9iH1RhYRzWwfDxzfMH898M76IoqIdrVdSNv+KoCkxcv8w/0KKiKm7HDgMJ65TPx/wLFACumIISNpFeB7wOvKonOA3WzfUl9UEdGOtm82lPQySX8CrgSulHSxpLX7F1pETMFyto+jjLBThsSbW29IETGOw4CTqK4grQT8uiyLiCHXyagdBwOftj3T9kzgM1QjBETE8HlE0rKUsd7Lo8Jzs3DEcFre9mG2nyrT4cDydQcVEZPrpI/0YrZ/PzZj+6wyKkBEDJ9PU53hWk3SeVQH5W3rDSkixnGPpH+lGqYSYEfgnhrjiYg2dTRqh6T/D/hpmf9XqpE8ImLI2L5E0huANaiGwLva9pM1hxURrX2Aqo/0d6iuIp0P7FxrRBHRlk66dnyA6qzWCcAvgeXKsogYMpI+Dixu+0rbVwCLS/pY3XFFxHPZvsn222wvb/sFtt9u+69j6yXtWWd8ETG+tgppSfMDJ9j+lO1X2l7f9u627+tzfBHRnQ/Zvn9spuTqh+oLJyKmYLu6A4iI1toqpG3PBZ6WtGSf44mI3phfksZmypfhjP8eMZo0eZOIqEMnfaQfBv4s6XTgkbGFtj/V86giYqpOBY6V9KMy/+GyLCJGj+sOICJa66SQPqFMETH8Pk9VPH+0zJ8O/KS+cCJiCnJGOmJIdfJkwyN6uWNJNwIPUT0k4inbs5vWC/gusAXwKPB+25f0MoaI6cr208APy9QTydmI/pC0jO17m5a9yPYNZfb4Fh+bbJs3knyN6LtJC2lJx9l+l6Q/0+Lyku11prD/N9q+e5x1mwOrl2lDqoJgwynsK2La63O+QnI2oh9+LWlz2w8CSFoLOA54GYDt/brcbvI1os/aOSO9W3ndsp+BtLA1cKRtA3+QtJSkF9q+fcBxRIySuvIVkrMR3dqPqph+K9XY70cCO/V5n8nXiB6YdNSOhqR6J/BkGe/yH9MU9m3gNEkXS9q1xfqVgZsb5m8py55F0q6S5kiac9ddd00hnIjR18d8heRsRF/YPpnqYSynAYcD77B96VQ3S/I1ou86udlwCeB0SfcCxwLH275jCvt+ve1bJb2gbPd/bZ/d6UZsHwwcDDB79uzc2RxR6XW+QnI2oqckfY9nd8FaErgO+ISkqY6KlXyNGIC2n2xo+6u21wY+DrwQ+B9Jv+t2x7ZvLa93AicCGzQ1uRVYtWF+lbIsIibR63wt20zORvTWHODihunfqZ4cPDbfteRrxGB0ckZ6zJ3A34B7gBd0s1NJiwHz2X6ovN8U2Kep2UlU38qPoboB4oH03Yro2JTzFZKzEf0wNhpWyanHy8PPxh6gtFC3202+RgxO24W0pI8B7wKWpxqK50O2r+pyvysAJ5YHry0AHG37VEkfAbB9EHAK1bA811INzbNzl/uKmOf0OF8hORvRT2cAm1A9+AxgEar+0q/tcnvJ14gB6eSM9KrA7j24AQLb1wOvaLH8oIb3prosHRGd61m+QnI2os8Wtj1WRGP7YUmLdrux5GvE4HTSR3pPqkeEryRpxtjUx9giokslXxeXtDOApOUlvajmsCKitUckvXJsRtL6wGM1xhMRbeqka8cngL2BO4Cny2IDU33AQ0T0mKS9gNlUY9IeBiwI/Ax4XZ1xRURLuwPHS7qN6nHgKwLb1xpRRLSlk64duwNr2L6nT7FERO+8A1gPuATA9m2Slqg3pIhoxfZFktak+uILcLXtJ+uMKSLa00khfTPwQL8CiYie+rttSzL84y7+iBgikja2faakbZpWvbSMI31CLYFFRNs6KaSvB86SdDLwxNhC29/ueVQRMVXHSfoRsJSkDwEfAH5cc0wR8WxvAM4EtmqxzkAK6Ygh10kh/dcyPa9METGkbH9L0puBB6kuF3/F9uk1hxURDWzvVV4z9FzEiGq7kLb9VQBJi9p+tH8hRUQvlMI5xXPEkJL06YnW54pvxPDrZNSO1wCHAIsDMyS9Aviw7Y/1K7iI6Iykh6guCbdk+/kDDCciJjbRDcDj5nFEDI9Ounb8J7AZ1WNFsX2ZpH/pR1AR0R3bSwBI2he4Hfgp1XBaOwEvrDG0iGjScKX3CGA32/eX+aWB/WsMLSLa1PYDWQBs39y0aG4PY4mI3nmb7R/Yfsj2g7Z/CGxdd1AR0dI6Y0U0gO37qIavjIgh10khfbOk1wKWtKCkPYC/9CmuiJiaRyTtJGl+SfNJ2gl4pO6gIqKl+cpZaAAkLUNnV4wjoiadJOpHgO8CKwO3AqcBH+9HUBExZe+mytfvUvW1PK8si4jhsz9wgaTjy/x2wNdrjCci2tTJqB13U/WzbEnSnrb/rSdRRcSU2L6RCbpyJF8jhoftIyXNATYui7axfVWdMUVEezrqIz2J7Xq4rYjor+RrxBCxfZXtA8uUIjpiRPSykFYPtxUR/ZV8jYiImKJeFtIZ8zJidCRfIyIipihnpCPmTcnXiIiIKWq7kC7D8TQve1HD7PHN6yOiHsnXiIiI/uvkjPSvJf3j8cKS1gJ+PTZve792NyRpVUm/l3SVpCsl7daizUaSHpB0aZm+0kGsEfO65GvEPCo5GzE4nYwjvR/VwfmtwBrAkUwwHN4kngI+Y/sSSUsAF0s6vcWdyufY3rLLfUTMy5KvEfOu5GzEgHQyjvTJkhakehDLEsA7bP9fNzu1fTtwe3n/kKS/UD3oJUP+RPRA8jVi3pWcjRicSQtpSd/j2Xf4LwlcB3xCErY/NZUAJM0C1gP+2GL1ayRdBtwG7GH7yhaf3xXYFWDGjBlTCSVi5A17vpZtJGcjBiTH2Ij+aueM9Jym+Yt7tXNJiwO/BHa3/WDT6kuAmbYflrQF8F/A6s3bsH0wcDDA7NmzM6RXzOuGOl8hORsxKDnGRvTfpIW07SMAJC0GPG57bpmfH1io2x2Xy86/BI6yfUKL/T7Y8P4UST+QtFx5VHlEtJB8jQhIzkYMSiejdpwBLNIwvwjwu252KknAIcBfbH97nDYrlnZI2qDEek83+4uYByVfI+ZRydmIwelk1I6FbT88NlMuBy3a5X5fB7wH+LOkS8uyLwIzyrYPArYFPirpKeAxYAfbuawU0Z7ka8S8KzkbMSCdFNKPSHql7UsAJK1PlXwds30ukzxZzfaBwIHdbD8ikq8R86rkbMTgdFJI7w4cL+k2qgRdEdi+H0FFxJTtTvI1IiKirzoZR/oiSWtSPdwB4GrbT/YnrIiYiuRrRERE/7UzjvTGts+UtE3TqpeWcWmfczdwRNQj+RoRETE47ZyRfgNwJrBVi3UGcmCOGB7J14iIiAFpZxzpvcrrzv0PJyKmIvkaERExOO107fj0ROvHG6MyIgYv+RoRETE47XTtWGKCdRlzMmK4JF8jIiIGpJ2uHV8FkHQEsJvt+8v80sD+fY0uIjqSfI2IiBicTh4Rvs7YQRnA9n3Aej2PKCJ6IfkaERHRZ50U0vOVs1oASFqGzh7oEhGDk3yNiIjos04OrPsDF0g6vsxvB3y99yFFRA8kXyMiIvqskycbHilpDrBxWbSN7av6E1ZETEXyNSIiov86utRbDsQ5GEeMgORrREREf3XSRzoiIiIiIooU0hERERERXUghHRERERHRhRTSERERERFdSCEdEREREdGF2gppSW+RdLWkayV9ocX6hSQdW9b/UdKsGsKMiCI5GzE6kq8Rg1FLIS1pfuD7wObAWsCOktZqarYLcJ/tlwDfAb452CgjYkxyNmJ0JF8jBqeuM9IbANfavt7234FjgK2b2mwNHFHe/wJ4kyQNMMaIeEZyNmJ0JF8jBqSjB7L00MrAzQ3ztwAbjtfG9lOSHgCWBe5ubCRpV2DXMvuwpKv7EnG0Yzma/n1ifPrW+3q5uZm93FgLydnpJ/naoRHK2eTr9JSc7cCg8rWuQrpnbB8MHFx3HAGS5tieXXccMdySs8Mh+RrtSL4Oj+TscKqra8etwKoN86uUZS3bSFoAWBK4ZyDRRUSz5GzE6Ei+RgxIXYX0RcDqkl4k6XnADsBJTW1OAsbOy28LnGnbA4wxIp6RnI0YHcnXiAGppWtH6Y/1CeC3wPzAobavlLQPMMf2ScAhwE8lXQvcS/WHIIZbLv9NU8nZaSn5Ok0lX6et5OwQUr6ARkRERER0Lk82jIiIiIjoQgrpiIiIiIgupJCOiIiIiOhCCunoCUlH1h1DRERExCCN/ANZYvAkNQ+jJOCNkpYCsP22gQcVERERMWAppKMbqwBXAT8BTFVIzwb2rzOoiOicpJ1tH1Z3HBERoyjD30XHJM0H7AZsAXzW9qWSrrf94ppDi4gOSfqr7Rl1xxERz5B0CXAC8HPb19UdT4wvZ6SjY7afBr4j6fjyegf5vxQxtCRdPt4qYIVBxhIRbVkaWAr4vaS/AT8HjrV9W61RxXPkjHRMmaS3Aq+z/cW6Y4mI5ypfdjcD7mteBZxve6XBRxUR45F0ie1Xlvf/DOwIbAP8heosdZ5yOCRSSEdETHOSDgEOs31ui3VH2353DWFFxDgaC+mGZfMDbwa2t71zPZFFsxTSEREREUNE0jG2d6g7jphcxpGOiIiIGCITFdGScjZ6iKSQjpYknd9h+40k/aZf8URERAQAX607gHhGRlqIlmy/tu4YIqI9ks7vJGclbQTsYXvLvgUVEV3LSDujI4V0tCTpYduLlwPu3sDdwMuAi4F/tW1JbwH+E3gUOLfhs4sB3yvtFwT2tv0rSd8F7rG9j6TNgC8BG5Xh9CKiS/niGzHtrMAEI+0MPpwYT7p2RDvWA3YH1gJeDLxO0sLAj4GtgPWBFRvafwk40/YGwBuB/yjF9Z7A9pLeCBwA7JwiOmLqJD1cXjeSdJakX0j6X0lHSVJZ95ay7BKqYbTGPruYpEMlXSjpT5K2Lsu/K+kr5f1mks4uD2OKiP77DbC47ZuaphuBs+oNLRrljHS040LbtwBIuhSYBTwM3GD7mrL8Z8Cupf2mwNsk7VHmFwZm2P6LpA8BZwP/L09riuiL9YC1gduA86i++M6h+uK7MXAtcGxD+7Evvh+QtBRwoaTfUX3xvUjSOVRffLfIF9+IwbC9ywTrMlzlEEkhHe14ouH9XCb/fyPgnbavbrHu5cA9QB4AEdEf+eIbETEguUwX3fpfYJak1cr8jg3rfgt8suGS8nrldSbwGaozZptL2nCA8UbMK7r94rtumWbY/ktZly++ERETSCEdXbH9ONUZrZNLn8s7G1bvS3WT4eWSrgT2LUX1IVQjBdwG7AL8pPS1joj+yhffiCGSIWanj3TtiJZsL15ez6Lhxgbbn2h4fyqwZovPPgZ8uMVmN2loczHV2a6I6DPbj0sa++L7KHAOsERZvS/V6DuXl5sJb5C0FQ1ffCXtAhwu6VXlS3RETEFG2pk+8ojwiIiIiAHqcojZF9veMkPMDpeckY6IiIioT0baGWHpIx0RERFRnwtt31KK3kupRtpZkzLSjquuAz9raL8p8IUyKs9ZPDPSzqPAh4DTgQMz0s5g5Ix0RERERH0yxOwIyxnpiIiIiOGSkXZGRArpiIiIiCGSIWZHR0btiIiIiIjoQs5IR0RERER0IYV0REREREQXUkhHRERERHQhhXRERERERBdSSEdEREREdCGFdEREREREF1JIR0RERER0IYV0REREREQXUkhHRERERHQhhXRERERERBdSSEdEREREdCGFdEREREREF1JITxOS3i/p3Ib5hyW9eJLPzJJkSQv0P8KIGM+o5q+knSSdVtf+I9o1zDkm6XWSrikxvb2f+4reSwE1TdlevO4YekXSLOAGYEHbT9UcTkTfjUr+2j4KOKruOCI6NWQ5tg9woO3v1h1IM0l7Ay+x/a91xzKsckY6poWcVY/oXPImor/azLGZwJW92r6k+ae6jWhfCukRJGlVSSdIukvSPZIObNHGkl5S3i8iaX9JN0l6QNK5khZp8Zl3SrpR0ssm2f/rJZ0v6X5JN0t6f1m+pKQjS1w3SfqypPnKur0l/axhG8+6ZCbpLEn7SjpP0kOSTpO0XGl+dnm9v1z6ek25THeepO9IugfYR9K9kl7esI8XSHpU0vKd/H4j+qnO/G3Iu10k/RU4syz/gKS/SLpP0m8lzWz4zKaSri77/oGk/5H0wbKu+XL5ayVdVNpeJOm1DesmyvGInhmlHJN0HfBi4Nfl+LZQOZYeIul2SbdK+ppKcdzi2Le3pMMl/VDSKZIeAd4oaSVJvyy/gxskfaohxr0l/ULSzyQ9CLx/nJ/lLcAXge1LbJdJ2k7SxU3tPi3pV+X94ZIOknR6yfP/afp7smZZd2/5u/Ku8X6XoyKF9IgpyfQb4CZgFrAycMwkH/sWsD7wWmAZ4HPA003b3Rn4JrCJ7Ssm2P9M4L+B7wHLA+sCl5bV3wOWpPqj8AbgvcDObf5oAO8u7V8APA/Yoyz/l/K6lO3FbV9Q5jcErgdWAPal+j00Xn7aETjD9l0dxBDRN3Xnb4M3AP8EbCZpa6qD5TZUOX0O8POy3eWAXwB7AssCV5c4Wv1sywAnAweUtt8GTpa0bEOz8XI8oidGLcdsrwb8FdiqHN+eAA4HngJeAqwHbAp8sGHbjce+r5dl7y7vlwDOB34NXFZ+/jcBu0varGEbW1Pl9lKM0z3L9qnAfsCxJbZXACcBL5L0Tw1N3wMc2TC/E9UxeTmq+uAoAEmLAacDR1P9DdgB+IGktVr+BkeF7UwjNAGvAe4CFmha/n7g3IZ5UyXhfMBjwCtabGtWabcHcBWwShv73xM4scXy+YG/A2s1LPswcFZ5vzfwsxb7XqDMnwV8uWH9x4BTW7Vt+Hn/2hTDhlR/kFTm5wDvqvvfLFOmsWkI8nfsMy9uWPbfwC4N8/MBj1Jdbn4vcEHDOgE3Ax9sjpvqYHph0/4uAN5f3o+b45ky9WoatRwr8zdSFehQFcdPAIs0tN8R+H3Dz9F87DscOLJhfsMWbfYEDivv9wbObvP3uTcNx+6y7IfA18v7tYH7gIUaYjmmoe3iwFxgVWB74Jymbf0I2Kvu/zdTmXJGevSsCtzk9m+6Ww5YGLhugjafBb5v+5Y2999qW8sBC1KdBRhzE9W34Xb9reH9o1QJOJGbG2ds/7F8biNJa1L9kTypg/1H9Fvd+TumMXdmAt9V1VXrfuBeqoJ5ZWClxraujnzj7Wclnp3/8Ny/AZ3meESnRi3Hms2kOpbe3tD+R1RncFtte7z9rTT2+bKNL1IV6RNto11HAO+WJKov0Me5OpP+nG3bfpjq512pxLVhU1w7AStOIZbapYP56LkZmCFpgTb/UNwNPA6sRnWZp5VNgVMl/c32L9vY/wbj7OdJqkS5qiybAdxa3j8CLNrQvpPEcQfLj6Dq3vE34Be2H+9gPxH9Vnf+jmnMnZupzi495/KupNWBVRrm1Tjf5Daq/G80Azi1zZgiemGkcqyFm6nOSC83Qfytjn3N+7vB9uptxjeR57Sz/QdJfwf+mapLybubmqw69kbS4lTdZW4rcf2P7Te3ue+RkDPSo+dC4HbgG5IWk7SwpNeN19j208ChwLfLzQfzq7pZb6GGZlcCbwG+L+ltk+z/KGATSe+StICkZSWta3sucBzwdUlLlL7UnwbGbjC8FPgXSTMkLUl1maldd1H1V5twzM/iZ8A7qIrpIydpGzFodedvKwcBe0paG/5x0/B2Zd3JwMslvV3VjcEfZ/wvwacAL5X07vK3YXtgLar+qhGDMmo51hzP7cBpwP6Sni9pPkmrSXpDB/u7EHhI0udV3Ug5v6SXSXpVF7HfAcxSGTigwZHAgcCTts9tWreFqkEJnkfVV/oPtm+m+lvwUknvkbRgmV7V1N965KSQHjGlYN2KqtvCX6kus24/ycf2AP4MXER1ieWbNP3b274M2BL4saTNJ9j/X4EtgM+UbV0KvKKs/iTVmefrgXOpbig4tHzudOBY4HLgYjo4uNp+lOomivPK5aBXT9D2ZuASqm/R57S7j4hBqDt/x4npxLLNY1TdwX8FsHlZdzewHfDvwD1UhfEcqjNmzdu5p8TwmdL2c8CWZRsRAzFqOTaO91LdjHsVVf/jXwAv7GB/c0us61I9g+Fu4CdUgwF06vjyeo+kSxqW/xR4Gc+cLGt0NLAX1e9yfcogALYfojq7vwPVGeq/Uf1eFmqxjZExdlNWxLQh6VDgNttfrjuWiOmknJW6BdjJ9u/rjici6qFqeMA7gVfavqZh+eHALfPS8Td9pGNaUfUUxG2ohgyKiCkqQ2b9kWpkg89S3ST1h1qDioi6fRS4qLGInlela0c8h6SdVA2+3jx19eSlQZG0L9Uls/+wfUPd8UTUoQ/5+xqqEQ3uprpk/nbbj/Us4IgRM6rHyPFI+u9xfp4vjtP+RmA3qm5c87x07YiIiIiI6ELOSEdEREREdGFa9ZFebrnlPGvWrLrDiBi4iy+++G7by9cdR6eSszGvGsWcTb7GvGqifJ1WhfSsWbOYM2dO3WFEDJyk5ifKjYTkbMyrRjFnk68xr5ooX/vatUPSoZLulHRFw7JlJJ0u6ZryuvQ4n31faXONpPf1M86IqCRnI0ZH8jWifv3uI3041dOAGn0BOKM8uvKMMv8skpahGsx7Q6rHUe813h+DiOipw0nORoyKw0m+RtSqr4W07bOpnmzTaGvgiPL+CODtLT66GXC67Xtt3wecznP/WEREjyVnI0ZH8jWifnX0kV6hPEseqsdDrtCizcrAzQ3zt5RlzyFpV2BXgBkzZvQsyPU/e2TPthXRysX/8d66Q2hXcjaCkcnZ5GsEg8vXWoe/czWI9ZQGsrZ9sO3Ztmcvv/xI3QAdMXKSsxGjI/ka0X91FNJ3SHohQHm9s0WbW4FVG+ZXKcsiYvCSsxGjI/kaMUB1FNInAWN3CL8P+FWLNr8FNpW0dLkBYtOyLCIGLzkbMTqSrxED1O/h734OXACsIekWSbsA3wDeLOkaYJMyj6TZkn4CYPteYF/gojLtU5ZFRB8lZyNGR/I1on59vdnQ9o7jrHpTi7ZzgA82zB8KHNqn0CKiheRsxOhIvkbUr9abDSMiIiIiRlUK6YiIiIiILqSQjoiIiIjoQgrpiIiIGkl6qaQzJF1R5teR9OW644qIyaWQjoiIqNePgT2BJwFsXw7sUGtEEdGWFNIRERH1WtT2hU3LnqolkojoSArpiIiIet0taTXK47wlbQvcXm9IEdGOvo4jHREREZP6OHAwsKakW4EbgJ3qDSki2pFCOiIioka2rwc2kbQYMJ/th+qOKSLak64dERERNZK0rKQDgHOAsyR9V9KydccVEZNLIR0REVGvY4C7gHcC25b3x9YaUUS0JV07IiIi6vVC2/s2zH9N0va1RRMRbcsZ6YiIiHqdJmkHSfOV6V3Ab+sOKiIml0I6IiKiXh8CjgaeKNMxwIclPSTpwVoji4gJpWtHREREjWwvUXcMEdGdnJGOiIiokaRfStpCUo7JESOm7aSV9PJ+BhIRETGP+iHVA1iukfQNSWvUHVBEtKeTb78/kHShpI9JWrJvEUVERMxDbP/O9k7AK4Ebgd9JOl/SzpIWrDe6iJhI24W07X+m+sa8KnCxpKMlvbmbnUpaQ9KlDdODknZvarORpAca2nylm31FxNQkXyP6rzyA5f3AB4E/Ad+lKqxP72JbydmIAenoZkPb10j6MjAHOABYT5KAL9o+oYPtXA2sCyBpfuBW4MQWTc+xvWUnMUZERdLKwEwa8tz22Z1uJ/ka0V+STgTWAH4KbGX79rLqWElzOt1ecjZicNoupCWtA+wMvJXqG/JWti+RtBJwAdB2Id3kTcB1tm/q8vMR0UTSN4HtgauAuWWxgY4L6SbJ14je+7HtUxoXSFrI9hO2Z09x28nZiD7q5Iz094CfUJ19fmxsoe3bylnqbu0A/Hycda+RdBlwG7CH7SunsJ+IecnbgTVsP9Hj7SZfI3rva8ApTcsuoOraMVXJ2Yg+aquQHrs0ZPunrdaPt7yN7T4PeBuwZ4vVlwAzbT8saQvgv4DVW2xjV2BXgBkzZnQTRsR0dD2wINXDHXqiF/latpOcjQAkrQisDCwiaT1AZdXzgUV7sP0cYyP6rK2bDW3PBVYtSdlLmwOX2L6jxT4ftP1weX8KsKCk5Vq0O9j2bNuzl19++R6HFzGyHgUulfQjSQeMTVPc5pTztaxPzkZUNgO+BawC7N8w/T/giz3Yfo6xEX3WSdeOG4DzJJ0EPDK20Pa3p7D/HRnnklP5pn6HbUvagKrov2cK+4qYl5xUpl5Kvkb0kO0jgCMkvdP2L8drJ+l9pW2nkrMRfdZJIX1dmeYDxh5n6m53LGkx4M3AhxuWfQTA9kHAtsBHJT0FPAbsYLvr/UXMS2wfUa4gvbQsutr2k91uL/ka0T8TFdHFbkBHhXRyNmIwOimkr7J9fOMCSdt1u2PbjwDLNi07qOH9gcCB3W4/Yl4maSOqA++NVP0uVy1ntboatSP5GlErTd7k2ZKzEYPRyZMNW92s0GpZRNRvf2BT22+w/S9UfTG/U3NMEdGdnCmOGFKTnpGWtDmwBbBy081Kzwee6ldgETElC5aHMgBg+//yqOGIkdXxGemIGIx2unbcRvUkw7cBFzcsf4jqzuKIGD5zJP0E+FmZ34kqjyNiiEiaD9jW9nETNDtvUPFERGcmLaRtXwZcJunoqdysFBED9VHg48Cnyvw5wA/qCyciWrH9tKTPAeMW0rY/McCQIqIDndxsuIGkvYGZ5XMCbPvF/QgsIrpXnmj47TJFxHD7naQ9gGN59vCy99YXUkS0o5NC+hCqrhwXA3P7E05ETIWk42y/S9KfaXGDku11aggrIia2fXn9eMMyAzlRFTHkOimkH7D9332LJCJ6YbfyumWtUURE22y/qO4YIqI7nRTSv5f0H8AJwBNjC21f0vOoIqIrtm8vrzfVHUtEtEfSosCngRm2d5W0OrCG7d/UHFpETKKTQnrD8jq7YZmBjXsXTkT0gqSHeG7XjgeoRu74jO3rBx9VRIzjMKpuk68t87cCxwMppCOGXNuFtO039jOQiOip/wRuAY6mujF4B2A14BLgUGCjugKLiOdYzfb2knYEsP2opIwdHTEC2i6kJX2l1XLb+/QunIjokbfZfkXD/MGSLrX9eUlfrC2qiGjl75IWoVxFkrQaDV0oI2J4dfKI8EcaprnA5sCsPsQUEVP3qKR3SZqvTO8CHi/r8rjhiOGyF3AqsKqko4AzgM/VG1JEtKOTrh37N85L+hbw255HFBG9sBPwXaqHsBj4A/Cv5axXHu4QMSTKkw2XBrYBXk3VFWs323fXGlhEtKWTmw2bLQqs0qtAIqJ3ys2EW42z+txBxhIR4xt7smF5RPjJdccTEZ1pu2uHpD9LurxMVwJXU93QFBFDRtJLJZ0h6Yoyv46kL9cdV0S09DtJe0haVdIyY1PdQUXE5Do5I934gIengDtsP9XjeCKiN34MfBb4EYDtyyUdDXyt1qgiopU82TBiRHXSR/omSa8A/rksOhu4vC9RRcRULWr7wqYRtPLFN2LIlD7SX7B9bN2xRETnOunasRtwFPCCMh0l6ZP9CiwipuTuMoTW2HBa2wK31xtSRDSz/TTV1aOIGEGddO3YBdjQ9iMAkr4JXAB8rx+BRcSUfBw4GFhT0q3ADVQjeUTE8PmdpD2AY6mGmAXA9r31hRQR7eikkBbV+NFj5pZlXZF0I/BQ2c5Ttmc3rRfV8F1bAI8C77d9Sbf7i5hXSJof+JjtTSQtBsxn+6EebPdGkrMR/dDzPtLJ14jB6KSQPgz4o6QTy/zbgUOmuP83TjBW5ubA6mXaEPhheY2ICdieK+n15f0jk7XvUHI2osdsv6hPm06+RvRZJzcbflvSWcDry6Kdbf+pL1FVtgaOtG3gD5KWkvRC2+nnGTG5P0k6CTieZ18qPqGP+0zORnRB0qLAp4EZtneVtDqwhu3f9HG3ydeIHujkZsNXA9fYPsD2AcB1kqby7dXAaZIulrRri/UrAzc3zN9SljXHtaukOZLm3HXXXVMIJ2JaWRi4B9iY6sEsW/HsISy7kZyN6I/DgL8Dry3ztzL1oSqTrxED0EnXjh8Cr2yYf7jFsk683vatkl4AnC7pf22f3elGbB9MdVMVs2fPdpexREwrtneeaL2kPW3/W4ebTc5G9MdqtreXtCOA7UfVNHZlF5KvEQPQ9hlpQOUSEPCPIXu6fsS47VvL653AicAGTU1uBVZtmF+lLIuIqduu0w8kZyP65u+SFuGZ4SpXA56YygaTrxGD0Ukhfb2kT0lasEy7Add3s1NJi0laYuw9sClwRVOzk4D3qvJq4IH03YromY7OdiVnI/pqL+BUYFVJRwFnAJ/rdmPJ14jB6eSM8keAA4AvU31rPgNo1e+qHSsAJ5YrVwsAR9s+VdJHAGwfBJxCNSzPtVRD80x4qToiOtLpJdrkbESPSXqd7fOonhS8DfBqqi+5u00w2kY7kq8RA9LJqB13AjuMt76TPpe2rwde0WL5QQ3vzbPH1IyI3unojHRyNqIvDgDWBy6w/Urg5F5sNPkaMThd93FuYTug05uXIqIPJC3T/FQ0SS+yfUOZPb6GsCLi2Z6UdDCwiqQDmlfa/lQNMUVEB3pZSE/1DuOI6J1fS9rc9oMAktYCjgNeBmB7vzqDiwigGpJyE2Az4OKaY4mILvSykM6wOBHDYz+qYvqtwBrAkcBO9YYUEY1KP+hjJP3F9mV1xxMRncsZ6YhpyPbJkhYETgOWAN5h+/9qDisiWntM0hnACrZfJmkd4G22p/pQlojos7YL6fS5jBh+kr7Hs68OLQlcB3xCUvpcRgynHwOfBX4EYPtySUcz9acbRkSfdXJGOn0uI4bfnKb59LuMGH6L2r6w6WGGT9UVTES0r5NCOn0uI4ac7SPgHw9heNz23DI/P7BQnbFFxLjuLk8zHHuy4bZAHo4SMQI6GUc6fS4jRscZVKMBPFzmF6HK3dfWFlFEjOfjwMHAmpJuBW4gJ6oiRsKkhXT6XEaMpIVtjxXR2H5Y0qJ1BhQRz1WuFn3M9iblStJ8th+qO66IaE87Z6TT5zJi9Dwi6ZW2LwGQtD7wWM0xRUQT23Mlvb68f6TueCKiM5MW0ulzGTGSdgeOl3Qb1dCUKwLb1xpRRIznT5JOohr96h/FtO0T6gspItrRyc2G6XMZMSJsXyRpTaobgwGutv1knTFFxLgWBu4BNm5YZiCFdMSQ66SQTp/LiCEnaWPbZ0rapmnVS8s9DTkwRwwZ2ztPtF7Snrb/bVDxRET7Oimk0+cyYvi9ATgT2KrFupzhihhN2wEppCOGUCeF9O6kz2XEULO9V3md8AxXRIwUTd4kIurQyTjS6XMZMeQkfXqi9ba/PahYIqJnPHmTiKhDO+NIp89lxOhYYoJ1ORhHjKackY4YUu2ckU6fy4gRYfurAJKOAHazfX+ZXxrYv8bQImIckpaxfW/TshfZvqHMHl9DWBHRhvkma9DY57LF9IFudippVUm/l3SVpCsl7daizUaSHpB0aZm+0s2+IuZR64wV0QC27wPW62ZDydeIvvu1pOePzUhaC/j12Lzt/TrZWHI2YnDa6drRjz6XTwGfsX2JpCWAiyWdbvuqpnbn2N6yi+1HzOvmk7R0KaCRtAyd3VzcKPka0V/7URXTb6W6D+lIYKcpbC85GzEg7RxYe97n0vbtwO3l/UOS/gKsDDQneUR0Z3/gAkljl4S3A77ezYaSrxH9ZftkSQtSPeRsCeAdtv9vCttLzkYMSDuPCO9rn0tJs6guOf+xxerXSLoMuA3Yw/aVLT6/K7ArwIwZM6YaTsS0YPtISXN45klp27Q4G9WxqeZr2UZyNgKQ9D2efUJqSeA64BPlZv5P9WAfs8gxNqJvOrnU+5w+l5K66nM5RtLiwC+B3W0/2LT6EmBmeYLiFsB/Aas3b8P2wcDBALNnz86oBBFFKZx7dgaqF/la4krORlTmNM1f3MuN5xgb0X+dFNK97HNJuYz1S+CoVkPoNSa97VMk/UDScrbv7nafEdGd5GtE79k+AkDSYsDjtueW+fmBhaay7eRsxGBMOmpHg7E+l/tK2hc4H/j3bnYqScAhwF/Gu1lR0oqlHZI2KLHe083+IqJ7ydeIvjsDWKRhfhHgd91uLDkbMTidPNmwl30uXwe8B/izpEvLsi8CM8q+DgK2BT4q6SngMWAH27msFDF4ydeI/lrY9sNjM6W7xaJT2F5yNmJAOuqa0as+l7bPZZInNdk+EDhwqvuKiKlJvkb03SOSXmn7EgBJ61MVt11JzkYMTtd9nCMiIqIndgeOl3QbVQG8IrB9rRFFRFtSSEdERNTI9kWS1qR6GAvA1bafrDOmiGhPCumIiIgaSNrY9pmStmla9dIyjvRzRtuIiOGSQjoiIqIebwDOBLZqsc5ACumIIZdCOiIioga29yqvO9cdS0R0J4V0REREDSR9eqL1440BHRHDI4V0REREPZaYYF3GdI4YASmkIyIiamD7qwCSjgB2s31/mV+a6mnCETHkOnlEeERERPTeOmNFNIDt+4D16gsnItqVQjoiIqJe85Wz0ABIWoZcMY4YCUnUiIiIeu0PXCDp+DK/HfD1GuOJiDalkI6IiKiR7SMlzQE2Lou2sX1VnTFFRHtSSEdERNSsFM4pniNGTPpIR0RERER0IYV0REREREQXUkhHRERERHQhhXRERERERBdSSEdEREREdKG2QlrSWyRdLelaSV9osX4hSceW9X+UNKuGMCOiSM5GjI7ka8Rg1FJIS5of+D6wObAWsKOktZqa7QLcZ/slwHeAbw42yogYk5yNGB3J14jBqeuM9AbAtbavt/134Bhg66Y2WwNHlPe/AN4kSQOMMSKekZyNGB3J14gBqeuBLCsDNzfM3wJsOF4b209JegBYFri7sZGkXYFdy+zDkq7uS8TRjuVo+veJ8elb7+vl5mb2cmMtJGenn+Rrh0YoZ5Ov01NytgODyteRf7Kh7YOBg+uOI0DSHNuz644jhltydjgkX6MdydfhkZwdTnV17bgVWLVhfpWyrGUbSQsASwL3DCS6iGiWnI0YHcnXiAGpq5C+CFhd0oskPQ/YATipqc1JwNh5+W2BM217gDFGxDOSsxGjI/kaMSC1dO0o/bE+AfwWmB841PaVkvYB5tg+CTgE+Kmka4F7qf4QxHDL5b9pKjk7LSVfp6nk67SVnB1CyhfQiIiIiIjO5cmGERERERFdSCEdEREREdGFFNIREREREV1IIR0RERExRCStKelNkhZvWv6WumKK1lJIR89J2rnuGCIiIkaRpE8BvwI+CVwhqfHx7vvVE1WMJ4V09MNX6w4gIp7ReBZL0pKSDpF0uaSjJa1QZ2wR8RwfAta3/XZgI+D/k7RbWae6gorWRv4R4VEPSZePtwrIgTliuOwHnFre7w/cDmwFbAP8CHh7PWFFRAvz2X4YwPaNkjYCfiFpJimkh04K6ejWCsBmwH1NywWcP/hwIqJNs22vW95/R9L7JmocEQN3h6R1bV8KYPthSVsChwIvrzWyeI4U0tGt3wCLjyV6I0lnDTyaiJjICyR9muqL7vMlqeFx0OniFzFc3gs81bjA9lPAeyX9qJ6QYjx5smFExDQnaa+mRT+wfZekFYF/t/3eOuKKiBh1KaQjIuYBktYEVgb+ONb/six/i+1Tx/9kRESMJ5f0IiKmOUmfJMNpRUT0XArpaElSRzcMStpI0m/6FU9ETMmuZDitiKGRY+z0kZsNoyXbr607hojomQynFTFEcoydPnJGOlqS9HB53UjSWZJ+Iel/JR0lSWXdW8qyS6jGox377GKSDpV0oaQ/jV1GlvRdSV8p7zeTdLak/B+M6L87JK07NlOK6i2B5chwWhEDl2Ps9JEz0tGO9YC1gduA84DXSZoD/BjYGLgWOLah/ZeAM21/QNJSwIWSfgfsCVwk6RzgAGAL208P7seImGdlOK2I4ZVj7AjLN5Vox4W2bykJeSkwC1gTuMH2NWU82p81tN8U+IKkS4GzgIWBGbYfpXr06enAgbavG9hPEDEPK/n7t3HWnTfoeCLiWXKMHWE5Ix3teKLh/Vwm/38j4J22r26x7uXAPcBKPYotIiJilOUYO8JyRjq69b/ALEmrlfkdG9b9FvhkQz+v9crrTOAzVJexNpe04QDjjYiIGBU5xo6IFNLRFduPUw2pdXK5EeLOhtX7AgsCl0u6Eti3JPwhwB62bwN2AX4iaeEBhx4x7WQorYjpJcfY0ZEnG0ZEzGPK8Hd72N6y5lAiIkZazkhHRIy4DKUVEVGP3GwYETG9ZCitiIgBydmFiIjpJUNpRUQMSM5IR0RMLxlKKyJiQHJGOiJi+stQWhERfZBCOiJimstQWhER/ZHh7yIiIiIiupAz0hERERERXUghHRERERHRhRTSERERERFdSCEdEREREdGFFNIREREREV1IIR0RERER0YUU0hERERERXfj/AfZ/mxbNw8MzAAAAAElFTkSuQmCC\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击环境变化分析"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:39:41.961797Z",
+ "start_time": "2020-11-13T15:39:41.949829Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def plot_envs(df, cols, r, c):\n",
+ " plt.figure()\n",
+ " plt.figure(figsize=(10, 5))\n",
+ " i = 1\n",
+ " for col in cols:\n",
+ " plt.subplot(r, c, i)\n",
+ " i += 1\n",
+ " v = df[col].value_counts().reset_index()\n",
+ " fig = sns.barplot(x=v['index'], y=v[col])\n",
+ " for item in fig.get_xticklabels():\n",
+ " item.set_rotation(90)\n",
+ " plt.title(col)\n",
+ " plt.tight_layout()\n",
+ " plt.show()"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAsgAAAFgCAYAAACmDI9oAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAA7s0lEQVR4nO3de7yu9ZzH/9e7REeRGqbDbhMyIWKPnGY0ySHKIZKEEaOZcaofOYSfchxmZBBjFDoQneQnMkhsJKN2Sakmhw46oSKddP78/riulXuv1uG+11r3uu611+v5eNyPdV/nz1p7ffb3s7739/peqSokSZIkNVbrOgBJkiRplFggS5IkST0skCVJkqQeFsiSJElSDwtkSZIkqYcFsiRJktTDAlmSgCSvSHJKz/INSR40zTFLk1SSe8zy2hcn2WE252jPM23MkqTpWSAPSZeN7VyxsdViVlXrVtWFXccxiLmKOcnTknwvyfVJrklyVpK3JllzLuKUFrJVoX3X9CyQ58libmy7lGS7JJd1HYe0UCTZFTgO+CKweVXdD9gN2BTYbJJjbPS1aK0KbaXuzgJZM5Jk9a5jkGYqyWZJjk9yVdtD+okJ9qkkD27fr5XkwCSXJPlTklOSrDXBMS9oh0s8Yprrv6w91zVJ3jFu22pJ3pbk1+32Y5Js0G77nySvG7f/z5LsMkjMSR6f5NQk17bHb9euD/AR4D1VdUhV/QGgqi6oqtdX1S/b/Q5IclySLyS5DnhFko2TnJDkD0l+leTVPTEeluR9Pcsr/eHa/sz2S3Jekj8mOdTeakldskCeAyPQ2E7Y2LXblid5b5IftR+XfjvJhu22QRrbw5J8Ksk3ktwI/EOSv2nPf22Sc5M8p+c8hyX5ZJIT2+v+JMkW434er0nyy3b7e5Ns0X4f17VFwT179t8pzce817b7bN2z7eIk+yY5u/15Hp1kzSTrAP8DbJzmI7Abkmw81c9Sq740f9x9HbgEWApsAhw1zWEfBh4LPBHYAHgLcOe48+4JfAjYoap+PsX1twI+BbwM2Bi4H03v7JjXA88DntJu/yPwyXbbl4Ddx51rc+DEfmNOskm7//va9fsCX06yEbBlG8uXp/hZjHkuTU/zfYAjaX6Gl7UxvxD4QJLt+zjPmD2AZwBbAA8F3jnAsdJQjED7/py2fb22bW//pmfbW5Nc3rahFyR56lx8z2pVla9ZvIDVgZ8B/wmsA6wJPBl4BXBKz34FPLh9/0lgOU3DvDpNA3Yvmsa6gHsAewK/GjtmiutvAlwDPIvmD56ntcsbtduXA7+maXDWapc/2G57OfCjnnNtBVwL3GuCmA8D/gQ8qb3Oem18bwfuCWwPXA9s2bP/NcDj2u/nSOCocT+PrwL3Bh4O3AKcDDwIWB84D/jHdt9tgN8D27Y/r38ELu6J82LgNJqGeQPgfOBf2m3bAZd1/Xvia3RewBOAq4B7jFs/Yc62v+9/Bh41wbnGcnbf9nd20z6u/65xubAOcCtNYU37+/vUnu1/DdzW5tF6wI00Qx8A3g98bsCY3wp8fty6b7V59eT2HGv2bDuq/X/hJuBl7boDgB/07LMZcAewXs+6fwMOa98fBryvZ9tKednm8L/0LD8L+HXXvyu+FveL7tv3h7b5/jRgDZo/cn9F0+ZuCVwKbNzuuxTYouuf2ar0sgd59h5HU5i9uapurKqbq+qUyXZOshrwSmDvqrq8qu6oqlOr6pae3fYB3gxsV1W/mub6LwW+UVXfqKo7q+okYAVNAzPm0Kr6RVX9GTgGeHS7/ivAo5Ns3i7vARw/LpZeX62qH1XVne051qUptm+tqu/S9Mrt3rP/V6rqtKq6naZAfvS48/17VV1XVecCPwe+XVUXVtWfaHp+t2n32wv4dFX9pP15HU5TUD++51wfr6orqvlI+GsTXEsasxlwSft72Y8NaRrGX0+xz5uBT1ZVP+PdN6Zp2ACoqhtp/pgcsznwlbbH6FqagvkO4P5VdT1N7++L2313p8mtQWLeHNh17PztNZ5MU4iPxfHXPfG9uKruA5xJ0+CPubTn/cbAH9r4xlxCUyT0q/d8l7TnlLrUdfu+G3BiVZ1UVbfRfCq0Fk3RfQdN4b1VkjWq6uKqmur/KA3IAnn2um5sp2rsxvy25/1NNIUtAzS2Y8Y3iJe2xfKY8Q3ihNft8bue93+eYHls/82BN437Hjdj5QZ0umtJYy4FlqT/G8uuBm6m+eh/Mk8H3pnkBX2c70p6bnZLsjbNMIve+Hasqvv0vNasqsvb7V8Cdk/yBJr/S743YMyX0vQg955/nar6IHABcDmwSx/fR/W8vwLYIMl6PeuWtOeCphds7Z5tD5jgfL03AC5pzyl1qev2fWOadhWAtr29FNikLa73ofk05/dJjnII4dyyQJ69rhvbqRq7fvTT2I4Z3yBu1v7FPKa3QZxLlwLvH/c9rl1VX+rj2Jp+Fy0yp9EUqR9Msk47Xv1Jk+3cNkqfAz6S5ka01ZM8Icm9enY7F3gm8MnesfiTOA7YKcmT23H272Hl/4v/G3j/2Cc7STZK8tye7d+g+aPxPcDR4/5I7SfmLwA7J3lGu37NNDfNbdoe9yZg/ySvTnLfNB4C3H+Kn9GlwKnAv7Xn2xp4VXstgLOAZyXZIMkDaBr28V6bZNM0NyS+Azh6qh+iNA+6bt+voMl14K6baDejbWer6otV9eR2n6K5B0JzxAJ59rpubCdt7PqMf9rGdhI/oempfUuSNdLcGLgz09/sNBOHAP+SZNu2sV4nybPH9VZN5nfA/ZKsP4S4tABV1R00v6sPBn5Dc2PZbtMcti9wDnA68Aeahmil/z+r6mfATsAhSXac4vrnAq+lmUbtSpqb8Hp7kz4GnAB8O8n1wP/SjL8fO/4W4Hhgh/YcA8XcFrPPpbl/4CqaIuDNY99PVR0NvIhm+NalNI3+McDBwLFTXG93mnGQV9AM39q/qr7Tbvs8zVjOi4FvM3Hx+8V224U0PXDvm2AfaT513b4fAzw7yVOTrEHzx+stwKlJtkyyfXvum2k+de23/VY/uhr8vCq9aHpO/z+a8XtXAx9n6kH8awEfpfkr8E/AD9p1S9v97tHut4ymwNtxmutvC3yfphG8imbYxJJ223Lgn3r2XSmudt1n2+v+7bj142/Se9+47Q9vr/snmhuUnt+zbaX9uftNOXedu10+BXhFz/L7gM/0LD+TpqG/luY/rGNpbwiiaXR36Nn3AOALPcufa/9trqW9ocGXL1+j8xqfw758jcprBNr357ft65/a9vbh7fqtaQr469u2/+u2b3P7SvuDliSpE0kupvlD/jvT7StJ88EhFpI0x5Lskb/Mvd37Orfr2CRJ07MHeQFIsgfw6Qk2XVJVD5/veCRJ0uzZvo8uC2RJkiSpR79Tl8yLDTfcsJYuXdp1GNJIOeOMM66uqo26jmMi5qy0MvNVWlgmy9mRKpCXLl3KihUrug5DGilJLpl+r26Ys9LKzFdpYZksZ71JT5IkSeox9AI5yX2SHJfk/5Kc3z6xTdIIMl+lhcWclYZjPoZYfAz4ZlW9sH2s6trzcE1JM2O+SguLOSsNwVAL5Pbxvn9P89QZqupW4NZhXlPSzJiv0sJizkrDM+we5AfSPPr40CSPAs4A9q6qG8d2SLIXsBfAkiVL5vTij33zEXN6PqkfZ/zHy7sOYaamzVcwZ7VqWcD5CraxWoTmK2eHPQb5HsBjgE9V1TbAjcDbeneoqoOrallVLdtoo5GcGUdaLKbNVzBnpRFiGysNybAL5MuAy6rqJ+3ycTTJLGn0mK/SwmLOSkMy1AK5qn4LXJpky3bVU4HzhnlNSTNjvkoLizkrDc98zGLxeuDI9u7aC4E95+GakmbGfJUWFnNWGoKhF8hVdRawbNjXkTR75qu0sJiz0nD4JD1JkiSphwWyJEmS1MMCWZIkSerRd4Gc5PP9rJM0OpLcO8l6XcchSdJCMkgP8sN7F5KsDjx2bsORNBeS/G2Sc4CzgZ8n+VkS81WSpD5MWyAn2S/J9cDWSa5rX9cDvwe+OvQIJc3EZ4HXVNXSqtoceC1waMcxSZK0IExbIFfVv1XVesB/VNW929d6VXW/qtpvHmKUNLg7quqHYwtVdQpwe4fxSJK0YPQ9D3JV7ZdkE2Dz3uOq6gfDCEzSrHw/yaeBLwEF7AYsT/IYgKo6s8vgJEkaZX0XyEk+CLyY5jGWd7SrC7BAlkbPo9qv+49bvw1N3m4/v+FIkrRwDPIkvecDW1bVLcMKRtLcqKp/6DoGSZIWqkEK5AuBNQALZGnEJXnXROur6j3zHYskSQvNIAXyTcBZSU6mp0iuqjfMeVSSZuvGnvdrAjsB53cUiyRJC8ogBfIJ7UvSiKuqA3uXk3wY+FZH4UiStKAMMovF4UnWApZU1QVDjEnS3Fsb2LTrICRJWggGedT0zsBZwDfb5UcnsUdZGkFJzklydvs6F7gA+GjHYUmStCAMMsTiAOBxwHKAqjoryYOGEJOk2dup5/3twO+qygeFSJLUh757kIHbqupP49bdOZfBSJobVXUJcB9gZ5opGrfqNCBJkhaQQQrkc5O8BFg9yUOSHAScOqS4JM1Ckr2BI4G/al9HJnl9t1FJkrQwDFIgvx54OM0Ub18CrgP2GUJMkmbvVcC2VfWuqnoX8Hjg1R3HJEnSgjDILBY3Ae9oX5JGW/jLI+Fp36ejWCRJWlD6LpCTLAPeDiztPa6qtp77sCTN0qHAT5J8pV1+HvDZ7sKRJGnhGGQWiyOBNwPn4M150shKshrwvzQzzjy5Xb1nVf20s6AkSVpABimQr6oq5z2WRlxV3Znkk1W1DXBm1/FIkrTQDFIg75/kM8DJNDfqAVBVx895VJJm6+QkLwCOr6rqOhhJkhaSQQrkPYGHAWvwlyEWBVggS6Pnn4E3ArcnuZnmBr2qqnt3G5YkSaNvkAL5b6tqy6FFImnOVNV6XccgSdJCNUiBfGqSrarqvKFFI2lWkqwOrFVVN7TLjwfu2W7+aVVd31lwkiQtEIMUyI8HzkpyEc0Y5LGPbKed5q1ttFcAl1fVTjOKVFI/PgT8Hvj3dvlLwM+BNWlu2HvrdCcwX6WFw3yVhmOQAvmZs7jO3sD5gOMfpeF6KvC3PcvXVtXOSQL8sM9zmK/SwmG+SkPQ96Omq+oS4DLgNpqb88ZeU0qyKfBs4DMzjFFS/1arqtt7lt8KzUc9wLrTHWy+St1Kct8kfT2Ay3yVhqfvAjnJ64HfAScBJ7avr/dx6EeBtzDJw0WS7JVkRZIVV111Vb/hSJrYPZPcdYNeVX0bIMn6NMMspvNRpsjX9lzmrDSHkixPcu8kG9AMhTokyUf6OPSjmK/SUPRdINN8jLNlVT28qh7Zvqb8KzfJTsDvq+qMyfapqoOrallVLdtoo40GCEfSBA4Bjk6yZGxFks1pxiJP2cvUT76COSsNwfpVdR2wC3BEVW0L7DDVAearNFyDjEG+FPjTgOd/EvCcJM+i6b26d5IvVNVLBzyPpD5U1UeS3ASckmQdmptprwc+WFWfmuZw81Xqxj2S/DXwIuAdfR5jvkpDNEiBfCGwPMmJrPwkvUk/Bqqq/YD9AJJsB+xr8krDVVX/Dfz32FCLfqd2M1+lzrwH+Bbwo6o6PcmDgF9OdYD5Kg3XIAXyb9rXPfnLvKqSRlCS+wMfADYGdkyyFfCEqvpst5FJGq+qjgWO7Vm+EHhBdxFJ6rtArqp3AyRZt12+YZALVdVyYPkgx0iascOAQ/nLx7W/AI4G+iqQzVdp/rSzURxEM2wCmikZ966qy/o53nyV5t4gs1g8IslPgXOBc5OckeThwwtN0ixsWFXH0N7d3k79dke3IUmaxKHACTSf+GwMfK1dJ6kjg8xicTDwxqravKo2B95Ec8e8pNFzY5L70c5V3j5yetCbbCXNj42q6tCqur19HQY45YTUoUHGIK9TVd8bW6iq5e1d8pJGzxtpeqS2SPIjmsb2hd2GJGkS1yR5Kc10jAC7A9d0GI+06A00i0WS/xf4fLv8UpqZLSSNmKo6M8lTgC1ppnq7oKpu6zgsSRN7Jc0Y5P+k+dTnVGDPTiOSFrlBhli8kqYX6njgy8CG7TpJIybJa4F1q+rcqvo5sG6S13Qdl6S7q6pLquo5VbVRVf1VVT2vqn4ztj3Jfl3GJy1GfRXISVYHjq+qN1TVY6rqsVW1T1X9ccjxSZqZV1fVtWMLba6+urtwJM3Crl0HIC02fRXIVXUHcGeS9Yccj6S5sXqSjC20f+Q6f7m0MGX6XSTNpUHGIN8AnJPkJODGsZVV9YY5j0rSbH0TODrJp9vlf27XSVp4qusApMVmkAL5+PYlafS9laYo/td2+STgM92FI2kW7EGW5tkgT9I7fJiBSJo7VXUn8Kn2JWmEJdmgqv4wbt0Dq+qidvHYCQ6TNETTFshJjqmqFyU5hwk+5qmqrYcSmaSBma/SgvS1JDtW1XUASbYCjgEeAVBVH+gyOGkx6qcHee/2607DDETSnDBfpYXnAzRF8rNp5i4/Atij25CkxW3aArmqrmzfvgA4qqquGG5IkmbKfJUWnqo6MckawLeB9YDnV9UvOg5LWtQGuUlvPeCkJH8AjgaOrarfDScsSbNkvkojLslBrDwUan3g18DrkjhLlNShQW7Sezfw7iRbA7sB309yWVXtMLToJM2I+SotCCvGLZ/RSRSS7maQHuQxvwd+C1wD/NXchiNpjpmv0ogamx0qyTrAze1DucYe7HOvLmOTFru+nqQHkOQ1SZYDJwP3o3mUrXfESyPIfJUWlJOBtXqW1wK+01EskhisB3kzYJ+qOmtIsUiaO+artHCsWVU3jC1U1Q1J1u4yIGmx67sHuar2o3nU9MZJloy9hhibpBlq83XdJHsCJNkoyQM7DkvSxG5M8pixhSSPBf7cYTzSotd3D3KS1wEHAL8D7mxXF+DHttKISbI/sIxmTtVDgTWALwBP6jIuSRPaBzg2yRU0j5V+AM3NtZI6MsgQi32ALavqmiHFImnuPB/YBjgToKquSLJetyFJmkhVnZ7kYTR/0AJcUFW3dRmTtNgNUiBfCvxpWIFImlO3VlUlKbjrLnlJIyTJ9lX13SS7jNv00HYe5OM7CUzSQAXyhcDyJCcCt4ytrKqPzHlUkmbrmCSfBu6T5NXAK4FDOo5J0sqeAnwX2HmCbQVYIEsdGaRA/k37umf7kjSiqurDSZ4GXEfzse27quqkjsOS1KOq9m+/7tl1LJJWNuiT9EiydlXdNLyQJM2FtiC2KJZGVJI3TrXdT2il7gwyi8UTgM8C6wJLkjwK+Oeqes2wgpM0mCTX03w0O6Gquvc8hiNpalPdODtpHksavkGGWHwUeAZwAkBV/SzJ3w8jKEkzU1XrASR5L3Al8HmaaaP2AP66w9AkjdPzyezhwN5VdW27fF/gwA5Dkxa9vh8UAlBVl45bdcdU+yfZLMn3kpyX5Nwkew8coaSZeE5V/VdVXV9V11XVp4DnTnWA+Sp1Zuux4higqv5IM03jlMxZaXgGKZAvTfJEoJKskWRf4PxpjrkdeFNVbQU8Hnhtkq1mGKuk/t2YZI8kqydZLckewI3THGO+St1Yre01BiDJBvT3Ca85Kw3JIAXyvwCvBTYBLgce3S5PqqqurKqxBxVcT1NQbzKjSCUN4iXAi2iefPk7YNd23aTMV6kzBwI/TvLednjUqcC/T3eQOSsNzyCzWFxNM45xQkn2q6p/m2L7UpqPjH4ybv1ewF4AS5Ys6TccSVOoqouZYkjFTPO13WbOSnOoqo5IsgLYvl21S1WdN8g5bGOluTXQGORp7DrZhiTrAl8G9qmq63q3VdXBVbWsqpZttNFGcxiOpCnMKF/BnJWGoarOq6pPtK9Bi2PbWGmOzWWBnAlXJmvQJO6RPjZTGhnmq7QKMGel4ZjLAvluczYmCc3cyec74bk0UsxXaYEzZ6XhGXYP8pOAlwHbJzmrfT1rDq8paWbMV2nhM2elIRnkSXobVNUfxq17YFVd1C4eO/6YqjqFST7KlTQ85qu06jNnpeEZpAf5a0nuekxtO9fi18aWq+oDcxmYpFkxXyVJmqFBCuQP0DS66yZ5LE0P1EuHE5akWTJfJUmaoUHmQT6xvVv228B6wPOr6hdDi0zSjJmvkiTN3LQFcpKDWPmO9/WBXwOvS0JVvWFYwUkajPkqSdLs9dODvGLc8hnDCETSnDBfJUmapWkL5Ko6HCDJOsDNVXVHu7w6cK/hhidpEOarJEmzN8hNeicDa/UsrwV8Z27DkTRHzFdJkmZokAJ5zaq6YWyhfb/23IckaQ6Yr5IkzdAgBfKNSR4zttBOHfXnuQ9J0hwwXyVJmqG+p3kD9gGOTXIFzZN7HgDsNoygJM3aPpivkiTNyCDzIJ+e5GHAlu2qC6rqtuGEJWk2zFdJkmaun3mQt6+q7ybZZdymh7bzqh4/pNgkDch8lSRp9vrpQX4K8F1g5wm2FWCDK40O81WSpFnqZx7k/duvew4/HEmzYb5KkjR7/QyxeONU26vqI3MXjqTZMF8lSZq9foZYrDfFtpqrQCTNCfNVkqRZ6meIxbsBkhwO7F1V17bL9wUOHGp0kgZivkqSNHuDPChk67HGFqCq/ghsM+cRSZoL5qskSTM0SIG8WtsLBUCSDRjsQSOS5o/5KknSDA3SYB4I/DjJse3yrsD75z4kSXPAfJUkaYYGeZLeEUlWANu3q3apqvOGE5ak2TBfJUmauYE+cm0bWBtZaQEwXyVJmplBxiBLkiRJqzwLZEmSJKmHBbIkSZLUwwJZkiRJ6mGBLEmSJPUYeoGc5JlJLkjyqyRvG/b1JM2c+SotLOasNBxDLZCTrA58EtgR2ArYPclWw7ympJkxX6WFxZyVhmfYPciPA35VVRdW1a3AUcBzh3xNSTNjvkoLizkrDcmwC+RNgEt7li9r10kaPeartLCYs9KQDPQkvWFIshewV7t4Q5ILuoxHd9kQuLrrIBaifPgf5/qUm8/1CWfDnB1Z5uwMmK/qiPk6Q/OVs8MukC8HNutZ3rRdd5eqOhg4eMhxaEBJVlTVsq7j0LyaNl/BnB1V5uyiZBu7QJmvo2/YQyxOBx6S5IFJ7gm8GDhhyNeUNDPmq7SwmLPSkAy1B7mqbk/yOuBbwOrA56rq3GFeU9LMmK/SwmLOSsOTquo6Bo2gJHu1H81JWgDMWWnhMF9HnwWyJEmS1MNHTUuSJEk9LJAlSZKkHhbIkiRJUg8LZE0oyRFdxyBJktSFzp+kp+4lGT9vZoB/SHIfgKp6zrwHJWlGkuxZVYd2HYckLWTOYiGSnAmcB3wGKJoC+Us0k85TVd/vLjpJg0jym6pa0nUckhptG3s88KWq+nXX8ag/9iALYBmwN/AO4M1VdVaSP1sYS6MpydmTbQLuP5+xSJrWfYH7AN9L8luaDqijq+qKTqPSlOxB1l2SbAr8J/A74Dn2QkmjKcnvgGcAfxy/CTi1qjae/6gkTSTJmVX1mPb93wG7A7sA59P0KvvAkBFkD7LuUlWXAbsmeTZwXdfxSJrU14F1q+qs8RuSLJ/3aCT1pap+CPwwyeuBpwG7ARbII8geZEmSpCFJclRVvbjrODQYp3mTJEkakqmK4yR7zmcs6p89yJIkSR1w1pnRZQ/yIpDk1AH33y7J14cVj6SpmbPSqiPJ2ZO8zsFZZ0aWN+ktAlX1xK5jkNQ/c1ZapdyfKWadmf9w1A97kBeBJDe0X7dLsjzJcUn+L8mRSdJue2a77kya6WfGjl0nyeeSnJbkp0me267/WJJ3te+fkeQHSfx9kuaAOSutUsZmnblk3OtiYHm3oWky9iAvPtsADweuAH4EPCnJCuAQYHvgV8DRPfu/A/huVb2yffT0aUm+A+wHnJ7kh8DHgWdV1Z3z921Ii4Y5Ky1gVfWqKba9ZD5jUf/sPVh8Tquqy9qG8SxgKfAw4KKq+mU1d21+oWf/pwNvS3IWzV+6awJLquom4NXAScAnfHymNDTmrCTNM3uQF59bet7fwfS/AwFeUFUXTLDtkcA1gE/tkobHnJWkeWYPsgD+D1iaZIt2efeebd8CXt8z7nGb9uvmwJtoPv7dMcm28xivtNiZs9KIcNaZVZMFsqiqm4G9gBPbG35+37P5vcAawNlJzgXe2za8nwX2raorgFcBn0my5jyHLi1K5qw0Opx1ZtXkg0IkSZJmKMkNVbVuku2AA4CrgUcAZwAvrapK8kzgo8BNwCnAg6pqpyTrAAe1+68BHFBVX03yMeCaqnpPkmfQ3Hy7nTfWzh/HIEuSJM0NZ51ZRTjEQpIkaW4468wqwh5kSZKkueGsM6sIe5AlSZKGx1lnFiALZEmSpCFx1pmFyVksJEmSpB72IEuSJEk9LJAlSZKkHhbIkiRJUg8LZEmSJKmHBbIkSZLUwwJZkiRJ6mGBLEmSJPWwQJYkSZJ6WCBLkiRJPSyQJUmSpB4WyJIkSVIPC2RJkiSphwXyCEryiiSn9CzfkORB0xyzNEklucfwI5Q0lYWaw0n2SPLtrq4v9WOU8yvJk5L8so3pecO8lobLYmoBqKp1u45hriRZClwErFFVt3ccjjQvFkoOV9WRwJFdxyENYsTy6z3AJ6rqY10HMl6SA4AHV9VLu45lIbAHWSPHXnBpZswdaXj6zK/NgXPn6vxJVp/tOTQzFsgdS7JZkuOTXJXkmiSfmGCfSvLg9v1aSQ5MckmSPyU5JclaExzzgiQXJ3nENNd/cpJTk1yb5NIkr2jXr5/kiDauS5K8M8lq7bYDknyh5xwrfXSVZHmS9yb5UZLrk3w7yYbt7j9ov17bfgT1hPbjsh8l+c8k1wDvSfKHJI/sucZfJbkpyUaD/HylYesyh3ty71VJfgN8t13/yiTnJ/ljkm8l2bznmKcnuaC99n8l+X6Sf2q3jf/o+olJTm/3PT3JE3u2TZXn0pxYSPmV5NfAg4Cvte3bvdq29LNJrkxyeZL3pS16J2j7DkhyWJJPJflGkhuBf0iycZIvtz+Di5K8oSfGA5Icl+QLSa4DXjHJ9/JM4O3Abm1sP0uya5Izxu33xiRfbd8fluS/k5zU5vj3x/1f8rB22x/a/1NeNNnPciGyQO5QmyRfBy4BlgKbAEdNc9iHgccCTwQ2AN4C3DnuvHsCHwJ2qKqfT3H9zYH/AQ4CNgIeDZzVbj4IWJ8m2Z8CvBzYs89vDeAl7f5/BdwT2Ldd//ft1/tU1bpV9eN2eVvgQuD+wHtpfg69HwPtDpxcVVcNEIM0VF3ncI+nAH8DPCPJc2kawl1o8vqHwJfa824IHAfsB9wPuKCNY6LvbQPgRODj7b4fAU5Mcr+e3SbLc2nWFlp+VdUWwG+Andv27RbgMOB24MHANsDTgX/qOXdv2/f+dt1L2vfrAacCXwN+1n7/TwX2SfKMnnM8lyav78MkQ6Sq6pvAB4Cj29geBZwAPDDJ3/Ts+jLgiJ7lPWja5A1p6oMjAZKsA5wEfJEm/18M/FeSrSb8CS5EVeWroxfwBOAq4B7j1r8COKVnuWiSazXgz8CjJjjX0na/fYHzgE37uP5+wFcmWL86cCuwVc+6fwaWt+8PAL4wwbXv0S4vB97Zs/01wDcn2rfn+/3NuBi2pfmPJu3yCuBFXf+b+fLV+xqBHB475kE96/4HeFXP8mrATTQf/b4c+HHPtgCXAv80Pm6ahvK0cdf7MfCK9v2kee7L11y8Flp+tcsX0xTe0BS9twBr9ey/O/C9nu9jfNt3GHBEz/K2E+yzH3Bo+/4A4Ad9/jwPoKftbtd9Cnh/+/7hwB+Be/XEclTPvusCdwCbAbsBPxx3rk8D+3f9ezNXL3uQu7UZcEn1f7PahsCawK+n2OfNwCer6rI+rz/RuTYE1qD5q33MJTR/vfbrtz3vb6JJrKlc2rtQVT9pj9suycNo/vM7YYDrS/Oh6xwe05s/mwMfSzNs6lrgDzSF8CbAxr37VtOqTXadjVn5/wC4+/8Dg+a5NIiFll/jbU7Tll7Zs/+naXpcJzr3ZNfbeOz49hxvpym+pzpHvw4HXpIkNH8UH1NNz/fdzl1VN9B8vxu3cW07Lq49gAfMIpaR4mDubl0KLElyjz7/A7gauBnYgubjlok8Hfhmkt9W1Zf7uP7jJrnObTQJcF67bglwefv+RmDtnv0HSYgaYP3hNMMsfgscV1U3D3AdaT50ncNjevPnUpoeobt91JrkIcCmPcvpXR7nCpr/A3otAb7ZZ0zSbC2o/JrApTQ9yBtOEf9Ebd/4611UVQ/pM76p3G2/qvrfJLcCf0cztOMl43bZbOxNknVphq1c0cb1/ap6Wp/XXnDsQe7WacCVwAeTrJNkzSRPmmznqroT+BzwkXbQ/uppbnK7V89u5wLPBD6Z5DnTXP9IYIckL0pyjyT3S/LoqroDOAZ4f5L12rHKbwTGbsw7C/j7JEuSrE/zcU+/rqIZDzblnJWtLwDPpymSj5hmX6kLXefwRP4b2C/Jw+GuG253bbedCDwyyfPS3FT7Wib/A/cbwEOTvKT9/2E3YCuaMaHSfFho+TU+niuBbwMHJrl3ktWSbJHkKQNc7zTg+iRvTXMD4upJHpHkb2cQ+++ApWlvuO9xBPAJ4LaqOmXctmeluZn/njRjkf+3qi6l+X/goUlelmSN9vW348YzL2gWyB1qC9GdaYYP/Ibmo87dpjlsX+Ac4HSajzo+xLh/x6r6GbATcEiSHae4/m+AZwFvas91FvCodvPraXqKLwROoRmI/7n2uJOAo4GzgTMYoMGsqptobj74UfuxzOOn2PdS4Eyav3p/2O81pPnSdQ5PEtNX2nMeleau9p8DO7bbrgZ2Bf4duIam4F1B08s1/jzXtDG8qd33LcBO7TmkoVto+TWJl9PcwHoezfje44C/HuB6d7SxPprmGQJXA5+huYl+UMe2X69JcmbP+s8Dj+AvnWC9vgjsT/OzfCztzfNVdT1Nb/yLaXqUf0vzc7nXBOdYkMZugJJGUpLPAVdU1Tu7jkVa1bQ9SZcBe1TV97qOR9L8SzMN3u+Bx1TVL3vWHwZctljbX8cga2SleereLjRT40iaA+30UD+hudv/zTQ3GP1vp0FJ6tK/Aqf3FsdyiMUqL8keaSYFH/+a0ZN+5kuS99J8dPUfVXVR1/FIXRlCDj+B5i7/q2k+vn5eVf15zgKWFpCF2kZOJsn/TPL9vH2S/S8G9qYZSqUeDrGQJEmSetiDLEmSJPUYqTHIG264YS1durTrMKSRcsYZZ1xdVRt1HcdEzFlpZeartLBMlrMjVSAvXbqUFStWdB2GNFKSjH+a2cgwZ6WVma/SwjJZzg59iEWS+yQ5Lsn/JTk/yROGfU1JM2O+SpI0Pz3IHwO+WVUvbJ/EsvZ0B0jqjPkqSVr0hlogt48h/nvgFQBVdStw6zCvKWlmzFdJkhrD7kF+IHAVcGiSR9E8lnjvqrpxbIckewF7ASxZsmROL/7YNx8xp+eT+nHGf7y86xBmatp8BXNWq5YFnK9DYx7OnL9Pq45hj0G+B/AY4FNVtQ1wI/C23h2q6uCqWlZVyzbaaCRv/JUWi2nzFcxZSdKqb9gF8mU0z/H+Sbt8HE0DLGn0mK+SJDHkArmqfgtcmmTLdtVTgfOGeU1JM2O+SpLUmI9ZLF4PHNneEX8hsOc8XFPSzJivkqRFb+gFclWdBSwb9nUkzZ75KknSPDwoRJIkSVpILJAlSZKkHhbIkiRJUg8LZEmSOpTkoUlOTvLzdnnrJO/sOi5pMbNAliSpW4cA+wG3AVTV2cCLO41IWuQskCVJ6tbaVXXauHW3dxKJJMACWZKkrl2dZAugAJK8ELiy25CkxW0+HhQiSZIm91rgYOBhSS4HLgL26DYkaXGzQJYkqUNVdSGwQ5J1gNWq6vquY5IWO4dYSJLUoST3S/Jx4IfA8iQfS3K/ruOSFjMLZEmSunUUcBXwAuCF7fujO41IWuQcYiFJUrf+uqre27P8viS7dRaNJHuQJUnq2LeTvDjJau3rRcC3ug5KWswskCVJ6targS8Ct7Svo4B/TnJ9kuumOjDJ6kl+muTr8xCntGg4xEKSpA5V1XqzOHxv4Hzg3nMUjiTsQZYkqVNJvpzkWUkGapOTbAo8G/jMcCKTFq++kzHJI4cZiCRJi9SnaB4M8sskH0yyZZ/HfRR4C3DnZDsk2SvJiiQrrrrqqtlHKi0Sg/y1+l9JTkvymiTrDy0iSZIWkar6TlXtATwGuBj4TpJTk+yZZI2JjkmyE/D7qjpjmnMfXFXLqmrZRhttNOexS6uqvgvkqvo7mr9wNwPOSPLFJE8bWmSSJC0S7YNBXgH8E/BT4GM0BfNJkxzyJOA5SS6mualv+yRfGH6k0uIw0E16VfXLJO8EVgAfB7ZJEuDtVXX8MAKUNDNJNgE2pyfPq+oH3UUkaSJJvgJsCXwe2Lmqrmw3HZ1kxUTHVNV+wH7t8dsB+1bVS4cfrbQ49F0gJ9ka2JPmhoCTaJL4zCQbAz8GLJClEZHkQ8BuwHnAHe3qAiyQpdFzSFV9o3dFkntV1S1VtayroKTFbJAe5INo7pR9e1X9eWxlVV3R9ipLGh3PA7asqlu6DkTStN4HfGPcuh/TDLGYVlUtB5bPbUjS4tZXgZxkdeDyqvr8RNsnWy+pMxcCa9A8dEDSCEryAGATYK0k2wBpN90bWLuzwCT1VyBX1R1JNktyz6q6ddhBSZq1m4CzkpxMT5FcVW/oLiRJ4zyD5sa8TYED+UuBfB3w9o5iksRgQywuAn6U5ATgxrGVVfWROY9K0myd0L4kjaiqOhw4PMkLqurLk+2X5B/bfSXNk0EK5F+3r9WAscdi1pxHJGnWqurwJPcEHtquuqCqbusyJkkTm6o4bu0NWCBL82iQAvm8qjq2d0WSXec4HklzoJ326XCahw4E2KzthXIWC2nhyfS7SJpLgzxJb78+10nq3oHA06vqKVX19zRjHf+z45gkzYyf1krzbNoe5CQ7As8CNkny8Z5N9wZu7+ci7SwYK2hmwthpJoFKGsgaVXXB2EJV/WKyR9aOZ75KI8ceZGme9dODfAVNY3kzcEbP6wSaXql+7A2cP5MAJc3IiiSfSbJd+zqEJo/7Yb5K8yTJakleNM1uP5qXYCTdZdoe5Kr6GfCzJF+cyU0+STalefre+4E3Dh6ipBn4V+C1wNi0bj8E/mu6g8xXaX5V1Z1J3gIcM8U+r5vHkCQx2E16j0tyALB5e1yAqqoHTXPcR4G38JeZL1aSZC9gL4AlS5YMEI6kybRP0PtI+xrER5kiX8GclYbgO0n2BY5m5WlU/9BdSNLiNkiB/Fng/6EZXnFHPwck2Qn4fVWd0d5VfzdVdTBwMMCyZcu8EUGahSTHVNWLkpzDBDf2VNXWUxw7bb625zBnpbm1W/v1tT3rCpiuA0rSkAxSIP+pqv5nwPM/CXhOkmcBawL3TvKFqnrpgOeR1J+9268zubnOfJU6UFUP7DoGSSsbpED+XpL/AI5n5UfXnjnZAVW1H+1UcG2P1L42ttLwVNWV7ddLZnCs+Sp1IMnaNGP+l1TVXkkeAmxZVV/vODRp0RqkQN62/bqsZ10B289dOJLmQpLrufsQiz/RzGTxpqq6cP6jkjSJQ2mGLz6xXb4cOBawQJY60neBXFX/MJsLVdVyYPlsziGpbx8FLgO+SHND7YuBLYAzgc8B2011sPkqzastqmq3JLsDVNVNSZz7WOpQ3wVykndNtL6q3jN34UiaI8+pqkf1LB+c5KyqemuSt3cWlaSJ3JpkLdpPfZJsQc9QRknzb5BHTd/Y87oD2BFYOoSYJM3eTUle1D6EYOxBBDe325x5Qhot+wPfBDZLciRwMs10i5I6MsgQiwN7l5N8GPjWnEckaS7sAXyM5uEgBfwv8NK2l8qHDkgjIslqwH2BXYDH0wyJ2ruqru40MGmRG+QmvfHWBjadq0AkzZ32JrydJ9l8ynzGImlyY0/Sq6pjgBO7jkdSo+8hFknOSXJ2+zoXuIDmRiBJIybJQ5OcnOTn7fLWSd7ZdVySJvSdJPsm2SzJBmOvroOSFrNBepB7HzxwO/C7qrp9juORNDcOAd4MfBqgqs5O8kXgfZ1GJWkiPklPGjGDjEG+JMmjgL9rV/0AOHsoUUmarbWr6rRxM0X5B600YtoxyG+rqqO7jkXSXwwyxGJv4Ejgr9rXkUleP6zAJM3K1e1UUWPTRr0QuLLbkCSNV1V30nzaI2mEDDLE4lXAtlV1I0CSDwE/Bg4aRmCSZuW1wMHAw5JcDlxEM7OFpNHznST7AkfTTKUKQFX9obuQpMVtkAI5NPMfj7mjXSdphCRZHXhNVe2QZB1gtaq6vuu4JE3KMcjSiBmkQD4U+EmSr7TLzwM+O+cRSZqVqrojyZPb9zdOt7+kblXVA7uOQdLKBrlJ7yNJlgNPblftWVU/HUpUkmbrp0lOAI5l5Y9sj+8uJEkTSbI28EZgSVXtleQhwJZV9fWOQ5MWrb4L5CSPB86tqjPb5Xsn2baqfjK06CTN1JrANcD2PesKsECWRs+hwBnAE9vly2n+uLVAljoyyBCLTwGP6Vm+YYJ1kkZAVe051fYk+1XVv81XPJKmtEVV7ZZkd4Cquinj5miUNL/6nuYNSFXV2EI7Nc1sHlUtqTu7dh2ApLvcmmQt/jIt4xbALd2GJC1ugxTIFyZ5Q5I12tfewIXDCkzSUNk7JY2O/YFvApslORI4GXhLtyFJi9sgBfK/0IyPuhy4DNgW2GsYQUkaupp+F0nDlORJ7dsfALsArwC+BCyrquUdhSWJwWax+D3w4sm2O6ZRWlDsQZa693HgscCPq+oxwIkdxyOpNZdjiHcFLJClEZBkg/FP4UrywKq6qF08toOwJK3stiQHA5sm+fj4jVX1hg5iksTcFsj2SEmj42tJdqyq6wCSbAUcAzwCoKo+0GVwkgDYCdgBeAbNNG+SRsRcFsiOaZRGxwdoiuRnA1sCRwB7dBuSpF5VdTVwVJLzq+pngx6fZDOa3L4/TRt8cFV9bI7DlBYle5ClVVBVnZhkDeDbwHrA86vqFx2HJWlif05yMnD/qnpEkq2B51TV+6Y57nbgTVV1ZpL1gDOSnFRV5w09YmkV1/csFkk2mGBd7/PjHdModSzJQUk+3o5n3B5YH7gIeN1EYxwljYRDgP2A2wCq6mymuCl+TFVdOfZ026q6Hjgf2GSIcUqLxiA9yI5plEbfinHLjmuURt/aVXXauIfn3T7ICZIsBbYBfjJu/V60U7IuWbJkdlFKi8ggBbJjGqURV1WHAyRZB7i5qu5ol1cH7tVlbJImdXX79LyxJ+m9ELiy34OTrAt8GdhnrBNrTFUdDBwMsGzZMu8Vkvo0yDzIjmmUFo6Tae6Ov6FdXosmd5/YWUSSJvNamiL2YUkupxkW1VcHVNsufxk4sqqOH16I0uIybYGc5CBWnqFifeDXNGManadRGk1rVtVYcUxV3ZBk7S4DknR37ac7r6mqHdpPflZrxxP3c2yAzwLnV9VHhhmntNj004M84zGNTkEjdebGJI8Zu4EnyWOBP091gPkqzb+quiPJk9v3Nw54+JOAlwHnJDmrXff2qvrGHIYoLUrTFsizHNPoFDRSN/YBjk1yBc0UjA8AdpvmGPNV6sZPk5xAMxvUXUXydEMmquoUnGJVGopBbtIbeExjVV1Je6NBVV2fZGwKGhtcaYiq6vQkD6O5oRbggqq6bZpjzFepG2sC19BMzTimAMcUSx0ZpECe1ZhGp6CRhi/J9lX13SS7jNv00Paegb4a3Mnytd1mzkpzqKr2nGp7kv2q6t/mKx5JAzwohHZM49hCP2Mae/adcgqaqlpWVcs22mijAcKRNIGntF93nuC1Uz8nmCpfwZyVOrBr1wFIi80gPcj7MPiYRqegkeZRVe3ffp2yR2oy5qs0khxnLM2zQeZBHnhMo1PQSPMryRun2j5VHpqv0sjyAR/SPOtnHuTZjGl0Chppfq03xbbpGlnzVRpN9iBL86yfHuSnAN+lGcM43pR32ToFjTS/qurdAEkOB/auqmvb5fsCB05zrPkqdSDJBlX1h3HrHlhVF7WLx3YQlrSo9TMP8qzGNErqxNZjxTFAVf0xyTYdxiNpcl9LsuPYTbFJtgKOAR4BUFUf6DI4aTHqZ4jFjMc0SurMaknuW1V/hKaHisFuypU0fz5AUyQ/m+Y+nyOAPboNSVrc+mkwZzOmUVI3DgR+nGTso9ldgfd3GI+kSVTVie0MMt+maXOfX1W/6DgsaVHrZ4jFjMc0SupGVR2RZAV/eTLXLj4yWhotSQ5i5Y6m9YFfA69rb4J/QzeRSRrkI1fHNEoLSFsQWxRLo2vFuOUzOolC0t0MUiA7plGSpDlSVYcDJFkHuLmq7miXVwfu1WVs0mI3SIHrmEZJkubeycAOwA3t8lo045Gf2FlE0iI3yJP0HNMoSdLcW7OqxopjquqGJGt3GZC02A00RMIxjZIkzbkbkzymqs4ESPJY4M8dx6QR8Jv3PLLrEBasJe86Z1bHO4ZYkqRu7QMcm+QKmqdZPgDYrdOIpEXOAlmSpA5V1elJHkbzkBCAC6rqti5jkhY7C2RJkjqQZPuq+m6SXcZtemg7D/LxnQQmyQJZkqSOPAX4LrDzBNsKsECWOmKBLElSB6pq//brnl3HImllFsiSJHUgyRun2l5VH5mvWCStzAJZkqRurDfFtpq3KCTdjQWyJEkdqKp3AyQ5HNi7qq5tl+9L8/RaSR1ZresAJEla5LYeK44BquqPwDbdhSPJAlmSpG6t1vYaA5BkA/yEV+qUCShJUrcOBH6c5Nh2eVfg/R3GIy16FsiSJHWoqo5IsgLYvl21S1Wd12VM0mJngSxJUsfagtiiWBoRjkGWJEmSelggS5IkST0skCVJkqQeFsiSJElSDwtkSZIkqcfQC+Qkz0xyQZJfJXnbsK8naebMV2lhMWel4RhqgZxkdeCTwI7AVsDuSbYa5jUlzYz5Ki0s5qw0PMPuQX4c8KuqurCqbgWOAp475GtKmhnzVVpYzFlpSIb9oJBNgEt7li8Dtu3dIclewF7t4g1JLhhyTOrPhsDVXQexEOXD/zjXp9x8rk84iWnzFczZEWbOzsACzldYvG3syP6uD+H3aZSN7L8DAPun3z0nzNnOn6RXVQcDB3cdh1aWZEVVLes6Do0ec3Y0mbOayKqYr/6uj4ZV/d9h2EMsLgc261netF0nafSYr9LCYs5KQzLsAvl04CFJHpjknsCLgROGfE1JM2O+SguLOSsNyVCHWFTV7UleB3wLWB34XFWdO8xras6sUh/JaXrm64Jnzi4yizhn/V0fDav0v0OqqusYJEmSpJHhk/QkSZKkHhbIkiRJUg8LZEmSJKmHBbIkSZKmlORhSZ6aZN1x65/ZVUzDZIGsKSXZs+sYJEkaz/Zp/iR5A/BV4PXAz5P0PtL8A91ENVzOYqEpJflNVS3pOg5JknrZPs2fJOcAT6iqG5IsBY4DPl9VH0vy06raptsI517nj5pW95KcPdkm4P7zGYuk6SV5ALA/cCfwLppenRcA5wN7V9WVHYYnzRnbp5GxWlXdAFBVFyfZDjguyeY0/xarHAtkQfOfzDOAP45bH+DU+Q9H0jQOA04E1gG+BxwJPAt4HvDfwHMnO1BaYGyfRsPvkjy6qs4CaHuSdwI+Bzyy08iGxAJZAF8H1h37xe+VZPm8RyNpOvevqoMAkrymqj7Urj8oyas6jEuaa7ZPo+HlwO29K6rqduDlST7dTUjD5RhkSVpgkvysqh7Vvn9fVb2zZ9s5VbVK9uhI0nxxFgtJWni+OjbV0rji+MHABZ1FJUmrCHuQJWkVkmTPqjq06zgkaSGzB3kRSDLQjQxJtkvy9WHFI2mo3t11ANJiYhu7avImvUWgqp7YdQyS5o5TX0mjwzZ21WQP8iKQ5Ib263ZJlic5Lsn/JTkySdptz2zXnQns0nPsOkk+l+S0JD8de3pOko8leVf7/hlJfpDE3ydpftyf5q7ynSd4XdNhXNKiYxu7arIHefHZBng4cAXwI+BJSVYAhwDbA78Cju7Z/x3Ad6vqlUnuA5yW5DvAfsDpSX4IfBx4VlXdOX/fhrSoOfWVNJpsY1cR/jWy+JxWVZe1iXYWsBR4GHBRVf2ymrs2v9Cz/9OBtyU5C1gOrAksqaqbgFcDJwGfqKpfz9t3IC1yVfWqqjplkm0vme94JN3FNnYVYQ/y4nNLz/s7mP53IMALqmqiqaMeSfNx7sZzFJskSQuZbewqwh5kAfwfsDTJFu3y7j3bvgW8vmcc1Tbt182BN9F8nLRjkm3nMV5JkhYK29gFyAJZVNXNwF7Aie0NBL/v2fxeYA3g7CTnAu9tE/mzwL5VdQXwKuAzSdac59ClVZLTRkmrDtvYhckHhUjSApdkO5rGdKeOQ5GkVYI9yJI0Ypw2SpK65U16kjTanDZKkuaZvQeSNNqcNkqS5pk9yJI02pw2SpLmmT3IkrTwOG2UJA2RBbIkLTBOGyVJw+U0b5IkSVIPe5AlSZKkHhbIkiRJUg8LZEmSJKmHBbIkSZLUwwJZkiRJ6mGBLEmSJPWwQJYkSZJ6/P+uPcmCOeO9dgAAAABJRU5ErkJggg==\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:39:55.476626Z",
+ "start_time": "2020-11-13T15:39:48.764592Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAtIAAAFgCAYAAACWgJ5JAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAABB/UlEQVR4nO3dd7gsVZn+/e9NkCwSjijhcBQVBhREz4AoMyJgAEEURUBMiDJm+CkG1FcQ1NEZUREcESQqShJGFFQQREAUOCBBYJAsSXJOyuF+/6i1pWl26O7d3dXd+/5cV127K3TVs/ucZ9fTq1atkm0iIiIiIqI9C9QdQERERETEMEohHRERERHRgRTSEREREREdSCEdEREREdGBFNIRERERER1IIR0RERER0YEU0hERHZD0XklnN8w/KOn5U7xnjiRLWmiax75e0qbT2UfZz5QxR0TExFJID5A6T8zdkhNzzFS2l7R9bd1xtKNbMUt6raTfSnpA0l2SLpL0GUmLdiPOiGE3Cuf3GF8K6QE2k0/MdZK0kaSb6o4jYhhI2gY4DvgxsKrt5YBtgZWBVSZ4TwqDmNFG4VwZlRTS0TeSFqw7hohOSFpF0vGS7igtrvuPs40lvaC8XkzSPpJukHSfpLMlLTbOe95aumm8eIrjv6vs6y5Jn29at4Ckz0q6pqw/RtKyZd0vJX20afuLJW3dTsySXiHpHEn3lvdvVJYL+Cawl+2DbN8NYPtK2x+zfVXZbk9Jx0n6kaT7gfdKWlHSiZLulnS1pA80xHiYpC83zD/ly235zHaXdLmkeyQdmtbviKhDCumaDMCJedwTY1l3hqS9Jf2+XKo9RdLyZV07J+bDJH1P0smSHgJeI+lfyv7vlXSZpDc17OcwSd+VdFI57rmSVmv6PD4s6aqyfm9Jq5Xf4/5SQDyjYfstVF1ivrdss3bDuusl7SbpkvJ5Hi1pUUlLAL8EVlR16e1BSStO9lnGaFP1BfAXwA3AHGAl4Kgp3vYN4OXAK4FlgU8DTzTtd0fg68Cmtv88yfHXBL4HvAtYEViOqrV3zMeANwOvLuvvAb5b1v0E2L5pX6sCJ7Uas6SVyvZfLst3A34qaRaweonlp5N8FmO2omq5fhZwJNVneFOJ+W3AVyVt3MJ+xuwAvB5YDXgR8IU23hvRMwNwfn9TOb/eW863/9Kw7jOSbi7n0CslbdKN33lGs52pzxOwIHAx8C1gCWBRYEPgvcDZDdsZeEF5/V3gDKqT+IJUJ7tFqE7sBhYCdgSuHnvPJMdfCbgL2Jzqy9Rry/yssv4M4Bqqk9NiZf5rZd27gd837GtN4F5gkXFiPgy4D3hVOc5SJb7PAc8ANgYeAFZv2P4uYL3y+xwJHNX0efwMeCawFvAYcBrwfGBp4HLgPWXbdYHbgfXL5/Ue4PqGOK8HzqM6iS8LXAF8sKzbCLip7v8nmQZjAjYA7gAWalo+br6W/+uPAOuMs6+xfN2t/H9duYXjf7EpD5YA/k5VgFP+727SsP65wD9KDi0FPETV5QLgK8Ahbcb8GeCHTct+XXJqw7KPRRvWHVX+JjwMvKss2xM4s2GbVYD5wFINy/4TOKy8Pgz4csO6p+Rkyd8PNsxvDlxT9/+VTJmo//z+opLzrwUWpvpCfDXVOXd14EZgxbLtHGC1uj+zYZ/SIl2P9agKuE/Zfsj2o7bPnmhjSQsA7wN2sX2z7fm2z7H9WMNmuwKfAjayffUUx38ncLLtk20/YftUYB7VyWjMobb/YvsR4BjgpWX5CcBLJa1a5ncAjm+KpdHPbP/e9hNlH0tSFeV/t306VUvf9g3bn2D7PNuPUxXSL23a33/Zvt/2ZcCfgVNsX2v7PqqW5HXLdjsD37d9bvm8DqcqvF/RsK/v2L7F1eXon49zrAioir4byv/JVixPdfK8ZpJtPgV813YrffFXpDr5AWD7IaovnGNWBU4orU/3UhXW84EVbD9A1Zq8Xdl2e6q8aifmVYFtxvZfjrEhVcE+FsdzG+LbzvazgAupioIxNza8XhG4u8Q35gaqQqJVjfu7oewzom51n9+3BU6yfartf1BdaVqMqjifT1WgrylpYdvX257s71S0IIV0Peo+MU92Yhzzt4bXD1MVwLRxYh7TfPK8sRTVY5pPnuMet8FtDa8fGWd+bPtVgU82/Y6r8NST7VTHioDq//BstX6D3J3Ao1RdDibyOuALkt7awv5upeGmPUmLU3XvaIxvM9vPapgWtX1zWf8TYHtJG1D9HfltmzHfSNUi3bj/JWx/DbgSuBnYuoXfww2vbwGWlbRUw7LZZV9Qtagt3rDuOePsr/FGxtllnxF1q/v8viLVeRWAcr69EVipFOG7Ul0hul3SUem6OH0ppOtR94l5shNjK1o5MY9pPnmuUr6Bj2k8eXbTjcBXmn7HxW3/pIX3eupNYgY5j6qY/ZqkJUpf+ldNtHE5cR0CfFPVDXULStpA0iINm10GvAH4buN9AhM4DthC0oblHoC9eOrf7gOAr4xdJZI0S9JWDetPpvpiuRdwdNMX2VZi/hGwpaTXl+WLqrr5b+Xyvk8Ce0j6gKRlVHkhsMIkn9GNwDnAf5b9rQ3sVI4FcBGwuaRlJT2H6uTf7COSVlZ1Y+XngaMn+xAj+qTu8/stVPkO/POG4FUo51nbP7a9YdnGVPdpxDSkkK5H3SfmCU+MLcY/5Yl5AudStfx+WtLCqm5w3JKpb9zqxEHAByWtX07sS0h6Y1ML2ERuA5aTtHQP4oohY3s+1f/TFwB/pbpBbtsp3rYbcClwPnA31cnqKX9vbV8MbAEcJGmzSY5/GfARquHlbqW6mbCxZWpf4ETgFEkPAH+kujdg7P2PAccDm5Z9tBVzKXq3orq34Q6qQuFTY7+P7aOBt1N1GbuRqjA4BjgQOHaS421P1UfzFqouY3vY/k1Z90OqfqbXA6cwfpH847LuWqrWvC+Ps01Ev9V9fj8GeKOkTSQtTPVF9zHgHEmrS9q47PtRqqu4rZ6/YyJ1dc6e6RNVS+z/UvUxvBP4DpPfjLAY8G2qb5X3AWeWZXPKdguV7eZSFYKbTXH89YHfUZ0w76DqrjG7rDsDeH/Dtk+Jqyw7uBz3X5uWN99s+OWm9WuV495HdbPVWxrWPWV7nn6D0T/3XebPBt7bMP9l4AcN82+gKgrupfrDdizl5iaqE/SmDdvuCfyoYf6Q8m9zL+XGjEyZMg3G1Jy/mTIN0jQA5/e3lPPrfeV8u1ZZvjZVof9AOff/Iue36U8qH25ERMRQkHQ91Zf930y1bUREL6VrR0REzSTtoCfHLW+cLqs7toiImFhapEeUpB2A74+z6gbba/U7noiIiJi+nN8HSwrpiIiIiIgOtDo8y1BYfvnlPWfOnLrDiOi7Cy644E7bs+qOo13J2ZiphjFnk68xU02WryNVSM+ZM4d58+bVHUZE30m6YeqtBk9yNmaqYczZ5GvMVJPla242jIiIiIjoQArpiIiIiIgOpJCOiIiIiOhACumIiIiIiA6M1M2G3fTyTx1Rdwgx4i7473fXHcJISc5GryVnuyf5Gr3Wr3xNi3RERERERAdSSEdEREREdCCFdEREREREB1JIR0RERER0IIV0REREREQHUkhHRERERHSgp4W0pEMk3S7pzw3LlpV0qqSrys9lJnjve8o2V0l6Ty/jjIhKcjZieCRfI+rX6xbpw4A3NC37LHCa7RcCp5X5p5C0LLAHsD6wHrDHRH8MIqKrDiM5GzEsDiP5GlGrlgtpST9sZVkj22cCdzct3go4vLw+HHjzOG99PXCq7btt3wOcytP/WETEFCQ9U9JSrW6fnI0YHsnXiPq10yK9VuOMpAWBl3dwzBVs31pe/w1YYZxtVgJubJi/qSx7Gkk7S5onad4dd9zRQTgRo0fSv0q6FLgE+LOkiyV1kq+QnI0YJsnXiD6aspCWtLukB4C1Jd1fpgeA24GfTefgtg14mvs40PZc23NnzZo1nV1FjJKDgQ/bnmN7VeAjwKHT3WlyNmJ4JF8jem/KQtr2f9peCvhv288s01K2l7O9ewfHvE3ScwHKz9vH2eZmYJWG+ZXLsohozXzbZ43N2D4beLzDfSVnI4ZH8jWij1ru2mF7d0krSXqlpH8fmzo45onA2B3C72H8Vu1fA6+TtEy5AeJ1ZVlEtOZ3kr4vaSNJr5b0P8AZkl4m6WVt7is5GzE8kq8RfbRQqxtK+hqwHXA5ML8sNnDmJO/5CbARsLykm6juEv4acIyknYAbgLeXbecCH7T9ftt3S9obOL/sai/bzTdURMTE1ik/92havi5V3m483puSsxHDI/kaUb+WC2ngLcDqth9r9Q22t59g1SbjbDsPeH/D/CHAIW3EFxGF7dd0+L7kbMSQSL5G1K+dQvpaYGGg5UI6Iuoh6YvjLbe9V79jiYiIGFXtFNIPAxdJOo2GYtr2x7seVURM10MNrxcFtgCuqCmWiIiIkdROIX1imSJiwNnep3Fe0jfIzUQRERFd1XIhbftwSYsBs21f2cOYIqL7Fqca4ioiIiK6pJ1HhG8JXAT8qsy/VFJaqCMGkKRLJV1SpsuAK4Fv1xxWRETESGmna8eewHrAGQC2L5L0/B7EFBHTt0XD68eB22x3+kCWiIiIGEfLLdLAP2zf17TsiW4GExHdYfsG4FnAllRDV65Za0AREREjqJ1C+jJJ7wAWlPRCSfsB5/QoroiYBkm7AEcCzy7TkZI+Vm9UERERo6WdQvpjwFpUQ9/9BLgf2LUHMUXE9O0ErG/7i7a/CLwC+EDNMUVERIyUdkbteBj4fJkiYrAJmN8wP78si4iIiC5puZCWNBf4HDCn8X221+5+WBExTYcC50o6ocy/GTi4vnAiIiJGTzujdhwJfAq4lNxkGDGwJC0A/JFqhJ0Ny+Idbf+ptqAiIiJGUDuF9B22M250xICz/YSk79peF7iw7ngiIiJGVTuF9B6SfgCcRnXDIQC2j+96VBExXadJeitwvG3XHUxERMQoaqeQ3hFYA1iYJ7t2GEghHTF4/gP4BPC4pEepbjS07WfWG1ZERMToaKeQ/lfbq3fjoJJWB45uWPR84Iu2v92wzUbAz4DryqLjbe/VjeNHjDrbS3VrX8nXiOGSnI3on3YK6XMkrWn78uke1PaVwEsBJC0I3AycMM6mZ9neYpzlETGOkk+L2X6wzL8CeEZZ/SfbD7S7z+RrxHBJzkb0TzuF9CuAiyRdR9VHeuxS8XSHv9sEuKY80jgipufrwO3Af5X5nwB/BhaluvHwM9Pcf/I1YrgkZyN6qJ1C+g09imE7qpP9eDaQdDFwC7Cb7cuaN5C0M7AzwOzZs3sUYsTQ2AT414b5e21vKUnAWV3Y/7TyFZKzEX2Wc2xED7X8iPDybfYm4B9UNxmOTR2T9AzgTcCx46y+EFjV9jrAfsD/ThDXgbbn2p47a9as6YQTMQoWsP14w/xnoLp0BCw5nR13I19LLMnZiAlIWkZSVx50lnNsRO+1XEhL+hhwG3AqcFKZfjHN428GXGj7tuYVtu8f6+dp+2RgYUnLT/N4EaPuGZL+eaOh7VMAJC1N1b1jOpKvET0g6QxJz5S0LFWBe5Ckb3Zh18nZiB5ruZAGdgFWt72W7ZeUabrfmrdngktOkp5TLkcjab0S613TPF7EqDsIOFrSP6/BSlqVKs9+MM19J18jemNp2/cDWwNH2F4f2LQL+03ORvRYO32kbwTu69aBJS0BvJZqvNuxZR8EsH0A8DbgQ5IeBx4BtsuDJSImZ/ubkh4Gzi45JuAB4Gu2v9fpfpOvET21kKTnAm8HPt+NHSZnI/qjnUL6WuAMSSfx1CcbdnT5yfZDwHJNyw5oeL0/sH8n+46YyUoeHTDWxaOTIe/G2WfyNaJ39gJ+Dfze9vmSng9cNZ0dJmcj+qOdQvqvZXoGT45LGxEDSNIKwFeBFYHNJK0JbGD74Hoji4hmto+l4YZA29cCb60voohoVcuFtO0vAUhassw/2KugImLaDgMO5cnLxH+hetJZCumIASNpZaqRM15VFp0F7GL7pvqiiohWtDNqx4sl/Qm4DLhM0gWS1updaBExDcvbPgZ4AqAMiTe/3pAiYgKHAidSXUFaEfh5WRYRA66dUTsOBD5he1XbqwKfpBohICIGz0OSlqOM9V4eFd61m4Ujoqtm2T7U9uNlOgzIoM0RQ6CdPtJL2P7t2IztM8pdwRExeD5B1cK1mqTfU52U31ZvSBExgbskvZMnh6rbngxFFzEU2hq1Q9L/B/ywzL+TaiSPiBgwti+U9Gpgdaoh8K60/Y+aw4qI8b2Pqo/0t6iuIp0D7FhrRBHRkna6dryPqlXreOCnwPJlWUQMGEkfAZa0fZntPwNLSvpw3XFFxNPZvsH2m2zPsv1s22+2/dex9ZJ2rzO+iJhYS4W0pAWB421/3PbLbL/c9q627+lxfBHRmQ/YvndspuTqB+oLJyKmYZu6A4iI8bVUSNueDzwhaekexxMR3bHg2ON/4Z9fhjP+e8Rw0tSbREQd2ukj/SBwqaRTgYfGFtr+eNejiojp+hVwtKTvl/n/KMsiYvjk0d0RA6qdQvr4MkXE4PsMVfH8oTJ/KvCD+sKJiGlIi3TEgGrnyYaH9zKQiOge208A3ytTRAwwScvavrtp2fNsX1dmjx3nbRExAKYspCUdY/vtki5lnMtLttfuSWQR0bbka8RQ+rmkzWzfDyBpTeAY4MUAtr9aZ3ARMbFWWqR3KT+36GUgEdEVydeI4fNVqmL6jVRjvx8B7FBvSBHRiikLadu3lpdvBY6yfUtvQ4qITiVfI4aP7ZMkLQycAiwFvMX2X2oOKyJa0M7NhksBp0q6GzgaONb2bb0JKyKmKfkaMeAk7cdTu2AtDVwDfFRSRsWKGALt3Gz4JeBLktYGtgV+J+km25t2cmBJ1wMPAPOBx23PbVovYF9gc+Bh4L22L+zkWBEzTbfzFZKzET0wr2n+gm7tOPka0R/ttEiPuR34G3AX8OxpHv81tu+cYN1mwAvLtD7V6APrT/N4ETNNN/MVkrMRXTM2GpakJYBHy8PPxh6gtEgXDpF8jeixlp5sCCDpw5LOAE4DlqN6BHEvRwDYCjjClT8Cz5L03B4eL2Jk1JCvkJyN6NRpwGIN84sBv+nxMZOvEV3QTov0KsCuti/q0rENnCLJwPdtH9i0fiXgxob5m8qyWxs3krQzsDPA7NmzuxRaxNDrdr5CcjaiVxa1/eDYjO0HJS0+zX0mXyP6oOUWadu7Uz0ifEVJs8emaRx7Q9svo7q89BFJ/97JTmwfaHuu7bmzZs2aRjgRo6Pk65KSdgSQNEvS86a52+RsRG88JOllYzOSXg48Ms19Jl8j+qDlFmlJHwX2BG4DniiLDXR0udj2zeXn7ZJOANYDzmzY5GaqVrUxK5dlETEFSXsAc6nGpD0UWBj4EfCqTveZnI3omV2BYyXdQvU48OdQ3STcseRrRH+03CJNleir217L9kvK1FERLWkJSUuNvQZeB/y5abMTgXer8grgvoYxciNicm8B3gQ8BFDGk16q050lZyN6x/b5wBrAh4APAv9iu+MRPJKvEf3TTh/pG4H7unTcFYATqtF3WAj4se1fSfoggO0DgJOphuW5mmponh27dOyImeDvtl36R46dTKcjORvRZZI2tn26pK2bVr2ojCN9fIe7Tr5G9Ek7hfS1wBmSTgIeG1to+5vtHtT2tcA64yw/oOG1gY+0u++IAOAYSd+nuhP/A8D7gIM63VlyNqInXg2cDmw5zjoDHRXSydeI/mmnkP5rmZ5RpogYULa/Iem1wP1U/aS/aPvUmsOKiAa29yg/0xocMaTafbIhkha3/XDvQoqIbiiFc4rniAEl6ROTre/kim9E9Fc7o3ZsABwMLAnMlrQO8B+2P9yr4CKiPZIeoLokPC7bz+xjOBExucluAJ4wjyNicLTTtePbwOup7vTF9sWdjksZEb1he+xO/b2pHqzwQ6rhtHYA8tSyiAHScKX3cGAX2/eW+WWAfWoMLSJa1M7wd9i+sWnR/C7GEhHd8ybb/2P7Adv32/4e1SOBI2LwrD1WRAPYvgdYt75wIqJV7RTSN0p6JWBJC0vaDbiiR3FFxPQ8JGkHSQtKWkDSDpQxpSNi4CxQWqEBkLQs7V0xjoiatJOoHwT2BVaievrRKWTonIhB9Q6qfN2Xqq/l78uyiBg8+wB/kHRsmd8G+EqN8UREi9oZteNOqn6W45K0u+3/7EpUETEttq9nkq4cydeIwWH7CEnzgI3Loq1tX15nTBHRmrb6SE9hmy7uKyJ6K/kaMUBsX257/zKliI4YEt0spNXFfUVEbyVfIyIipqmbhXTGvIwYHsnXiIiIaUqLdMTMlHyNiIiYppYL6TIcT/Oy5zXMHtu8PiLqkXyNiIjovXZapH8u6Z+PF5a0JvDzsXnbX+1mYBExLcnXiIiIHmunkP4q1cl5SUkvp2rRemdvwoqIaUq+RkRE9Fg740ifJGlhqgexLAW8xfZfehZZRHQs+RoREdF7UxbSkvbjqXf4Lw1cA3xUErY/3u5BJa0CHAGsUPZ9oO19m7bZCPgZcF1ZdLztvdo9VsRMknyNiORsRP+00iI9r2n+gi4c93Hgk7YvlLQUcIGkU8cZhP4s21t04XgRM0XyNSKSsxF9MmUhbftwAElLAI/anl/mFwQW6eSgtm8Fbi2vH5B0BbASkKc5RUxD8jUikrMR/dPOzYanAYs1zC8G/Ga6AUiaA6wLnDvO6g0kXSzpl5LWmuD9O0uaJ2neHXfcMd1wIkbFQOZr2UdyNqJPco6N6K12CulFbT84NlNeLz6dg0taEvgpsKvt+5tWXwisansdYD/gf8fbh+0Dbc+1PXfWrFnTCSdilAxkvpZYkrMRfZBzbETvtVNIPyTpZWMzZUitRzo9cBlR4KfAkbaPb15v+/6xQsD2ycDCkpbv9HgRM0zyNWIGS85G9EfLw98BuwLHSrqF6vHCzwG27eSgkgQcDFxh+5sTbPMc4DbblrQeVdF/VyfHi5iBdiX5GjEjJWcj+qedcaTPl7QGsHpZdKXtf3R43FcB7wIulXRRWfY5YHY51gHA24APSXqcqiVtO9seZ18R0ST5GjGjJWcj+qSVcaQ3tn26pK2bVr2ojEv7tEtGU7F9NlUr2WTb7A/s3+6+I2ay5GtEJGcj+qeVFulXA6cDW46zzkDbJ+aI6Jnka0RERJ+0Mo70HuXnjr0PJyKmI/kaERHRP6107fjEZOsnupEhIvov+RoREdE/rXTtWGqSdbkxIWKwJF8jIiL6pJWuHV8CkHQ4sIvte8v8MsA+PY0uItqSfI2IiOifdh7IsvbYSRnA9j1Ujx2NiMGTfI2IiOixdgrpBUqrFgCSlqW9B7pERP8kXyMiInqsnRPrPsAfJB1b5rcBvtL9kCKiC5KvERERPdbOkw2PkDQP2Lgs2tr25b0JKyKmI/kaERHRe21d6i0n4pyMI4ZA8jUiIqK32ukjHRERERERRQrpiIiIiIgOpJCOiIiIiOhACumIiIiIiA6kkI6IiIiI6EBthbSkN0i6UtLVkj47zvpFJB1d1p8raU4NYUZEkZyNGB7J14j+qKWQlrQg8F1gM2BNYHtJazZtthNwj+0XAN8Cvt7fKCNiTHI2YngkXyP6p64W6fWAq21fa/vvwFHAVk3bbAUcXl4fB2wiSX2MMSKelJyNGB7J14g+qauQXgm4sWH+prJs3G1sPw7cByzXl+giollyNmJ4JF8j+qStJxsOIkk7AzuX2QclXVlnPDPc8sCddQcxLPSN93Rzd6t2c2e9lJwdGMnXNs3EnE2+DpTkbBv6la91FdI3A6s0zK9clo23zU2SFgKWBu5q3pHtA4EDexRntEHSPNtz644jeiI5O2KSryMt+TqCkrODqa6uHecDL5T0PEnPALYDTmza5kRg7OvE24DTbbuPMUbEk5KzEcMj+RrRJ7W0SNt+XNJHgV8DCwKH2L5M0l7APNsnAgcDP5R0NXA31R+CiKhBcjZieCRfI/pH+QIa3SJp53IZMCIGXPI1YrgkZwdTCumIiIiIiA7kEeERERERER1IIR0RERER0YEU0hERERERHUghHV0h6Yi6Y4iIiIjop6F/smH0n6Tm8UgFvEbSswBsv6nvQUVERyTtaPvQuuOIiBhGGbUj2ibpQuBy4AeAqQrpn1DGIbX9u/qii4h2SPqr7dl1xxERTyrn2eOBn9i+pu54YmJpkY5OzAV2AT4PfMr2RZIeSQEdMZgkXTLRKmCFfsYSES1ZBngW8FtJf6NqrDra9i21RhVPkxbp6JiklYFvAbcBb0qrVsRgknQb8HrgnuZVwDm2V+x/VBExEUkX2n5Zef1vwPbA1sAVVK3UeTDLgEiLdHTM9k3ANpLeCNxfdzwRMaFfAEvavqh5haQz+h5NRLTM9lnAWZI+BrwW2BZIIT0g0iIdERERMUAkHWV7u7rjiKll+LuIiIiIATJZES1px37GEpNLi3RERETEkMhIO4MlLdIxLknntLn9RpJ+0at4ImJiydeI0SLpkgmmS8lIOwMlNxvGuGy/su4YIqI1ydeIkbMCk4y00/9wYiJpkY5xSXqw/NxI0hmSjpP0f5KOlKSy7g1l2YVUw/KMvXcJSYdIOk/SnyRtVZbvK+mL5fXrJZ0pKf8HI6Yp+RoxcsZG2rmhaboeOKPe0KJRWqSjFesCawG3AL8HXiVpHnAQsDFwNXB0w/afB063/b7y2PDzJP0G2B04X9JZwHeAzW0/0b9fI2JGSL5GDDnbO02y7h39jCUml9aFaMV5tm8qJ9GLgDnAGsB1tq9ydcfqjxq2fx3wWUkXUX1zXhSYbfth4APAqcD+eexpRE8kXyMi+iQt0tGKxxpez2fq/zcC3mr7ynHWvQS4C8iT1CJ6I/kaEdEnaZGOTv0fMEfSamV++4Z1vwY+1tA3c93yc1Xgk1SXnjeTtH4f442YyZKvEQMkI+2MjhTS0RHbjwI7AyeVm5dub1i9N7AwcImky4C9y0n6YGA327cAOwE/kLRon0OPmHGSrxGDJSPtjI48kCUiIiKijyQ9aHtJSRsBewJ3Ai8GLgDeaduS3gB8G3gYOBt4vu0tJC0B7Fe2XxjY0/bPJO0L3GV7L0mvp7qReKPcJNxb6SMdERERUZ+MtDPE0rUjIiIioj4ZaWeIpUU6IiIioj4ZaWeIpUU6IiIiYrBkpJ0hkUI6IiIiYoBkpJ3hkVE7IiIiIiI6kBbpiIiIiIgOpJCOiIiIiOhACumIiIiIiA6kkI6IiIiI6EAK6YiIiIiIDqSQjoiIiIjoQArpiIiIiIgOpJCOiIiIiOhACumIiIiIiA6kkI6IiIiI6EAK6YiIiIiIDqSQjoiIiIjoQArpESHpvZLObph/UNLzp3jPHEmWtFDvI4yIiQxr/kraQdIpdR0/olWDnGOSXiXpqhLTm3t5rOi+FFAjyvaSdcfQLZLmANcBC9t+vOZwInpuWPLX9pHAkXXHEdGuAcuxvYD9be9bdyDNJO0JvMD2O+uOZVClRTpGQlrVI9qXvInorRZzbFXgsm7tX9KC091HtC6F9BCStIqk4yXdIekuSfuPs40lvaC8XkzSPpJukHSfpLMlLTbOe94q6XpJL57i+BtKOkfSvZJulPTesnxpSUeUuG6Q9AVJC5R1e0r6UcM+nnLJTNIZkvaW9HtJD0g6RdLyZfMzy897y6WvDcplut9L+paku4C9JN0t6SUNx3i2pIclzWrn843opTrztyHvdpL0V+D0svx9kq6QdI+kX0tateE9r5N0ZTn2/0j6naT3l3XNl8tfKen8su35kl7ZsG6yHI/ommHKMUnXAM8Hfl7Ob4uUc+nBkm6VdLOkL6sUx+Oc+/aUdJik70k6WdJDwGskrSjpp+UzuE7Sxxti3FPScZJ+JOl+4L0T/C5vAD4HbFtiu1jSNpIuaNruE5J+Vl4fJukASaeWPP9d09+TNcq6u8vflbdP9FkOixTSQ6Yk0y+AG4A5wErAUVO87RvAy4FXAssCnwaeaNrvjsDXgU1t/3mS468K/BLYD5gFvBS4qKzeD1ia6o/Cq4F3Azu2+KsBvKNs/2zgGcBuZfm/l5/Psr2k7T+U+fWBa4EVgL2pPofGy0/bA6fZvqONGCJ6pu78bfBq4F+A10vaiupkuTVVTp8F/KTsd3ngOGB3YDngyhLHeL/bssBJwHfKtt8ETpK0XMNmE+V4RFcMW47ZXg34K7BlOb89BhwGPA68AFgXeB3w/oZ9N577vlKWvaO8Xgo4B/g5cHH5/TcBdpX0+oZ9bEWV289igu5Ztn8FfBU4usS2DnAi8DxJ/9Kw6buAIxrmd6A6Jy9PVR8cCSBpCeBU4MdUfwO2A/5H0prjfoLDwnamIZqADYA7gIWalr8XOLth3lRJuADwCLDOOPuaU7bbDbgcWLmF4+8OnDDO8gWBvwNrNiz7D+CM8npP4EfjHHuhMn8G8IWG9R8GfjXetg2/71+bYlif6g+Syvw84O11/5tlyjQ2DUD+jr3n+Q3Lfgns1DC/APAw1eXmdwN/aFgn4Ebg/c1xU51Mz2s63h+A95bXE+Z4pkzdmoYtx8r89VQFOlTF8WPAYg3bbw/8tuH3aD73HQYc0TC//jjb7A4cWl7vCZzZ4ue5Jw3n7rLse8BXyuu1gHuARRpiOaph2yWB+cAqwLbAWU37+j6wR93/b6YzpUV6+KwC3ODWb7pbHlgUuGaSbT4FfNf2TS0ef7x9LQ8sTNUKMOYGqm/Drfpbw+uHqRJwMjc2ztg+t7xvI0lrUP2RPLGN40f0Wt35O6Yxd1YF9lXVVete4G6qgnklYMXGbV2d+SY6zoo8Nf/h6X8D2s3xiHYNW441W5XqXHprw/bfp2rBHW/fEx1vxbH3l318jqpIn2wfrToceIckUX2BPsZVS/rT9m37Qarfd8US1/pNce0APGcasdQuHcyHz43AbEkLtfiH4k7gUWA1qss843kd8CtJf7P90xaOv94Ex/kHVaJcXpbNBm4urx8CFm/Yvp3EcRvLD6fq3vE34Djbj7ZxnIheqzt/xzTmzo1UrUtPu7wr6YXAyg3zapxvcgtV/jeaDfyqxZgiumGocmwcN1K1SC8/Sfzjnfuaj3ed7Re2GN9knrad7T9K+jvwb1RdSt7RtMkqYy8kLUnVXeaWEtfvbL+2xWMPhbRID5/zgFuBr0laQtKikl410ca2nwAOAb5Zbj5YUNXNeos0bHYZ8Abgu5LeNMXxjwQ2lfR2SQtJWk7SS23PB44BviJpqdKX+hPA2A2GFwH/Lmm2pKWpLjO16g6q/mqTjvlZ/Ah4C1UxfcQU20b0W935O54DgN0lrQX/vGl4m7LuJOAlkt6s6sbgjzDxl+CTgRdJekf527AtsCZVf9WIfhm2HGuO51bgFGAfSc+UtICk1SS9uo3jnQc8IOkzqm6kXFDSiyX9awex3wbMURk4oMERwP7AP2yf3bRuc1WDEjyDqq/0H23fSPW34EWS3iVp4TL9a1N/66GTQnrIlIJ1S6puC3+lusy67RRv2w24FDif6hLL12n6t7d9MbAFcJCkzSY5/l+BzYFPln1dBKxTVn+MquX5WuBsqhsKDinvOxU4GrgEuIA2Tq62H6a6ieL35XLQKybZ9kbgQqpv0We1eoyIfqg7fyeI6YSyz6NU3cH/Z2Czsu5OYBvgv4C7qArjeVQtZs37uavE8Mmy7aeBLco+Ivpi2HJsAu+muhn3cqr+x8cBz23jePNLrC+legbDncAPqAYDaNex5eddki5sWP5D4MU82VjW6MfAHlSf5cspgwDYfoCqdX87qhbqv1F9LouMs4+hMXZTVsTIkHQIcIvtL9QdS8QoKa1SNwE72P5t3fFERD1UDQ94O/Ay21c1LD8MuGkmnX/TRzpGiqqnIG5NNWRQRExTGTLrXKqRDT5FdZPUH2sNKiLq9iHg/MYieqZK1454Gkk7qBp8vXnq6MlL/SJpb6pLZv9t+7q644moQw/ydwOqEQ3upLpk/mbbj3Qt4IghM6znyIlI+uUEv8/nJtj+emAXqm5cM166dkREREREdCAt0hERERERHRipPtLLL7+858yZU3cYEX13wQUX3Gl7Vt1xtCs5GzPVMOZs8jVmqsnydaQK6Tlz5jBv3ry6w4joO0nNT5QbCsnZmKmGMWeTrzFTTZav6doRERExA0n6f5Iuk/RnST+RtGjdMUUMmxTSERERM4yklYCPA3NtvxhYkOpBGRHRhhTSERERM9NCwGLl8e+LUz1tLiLaMFJ9pLvp5Z86ou4QYsRd8N/vrjuEkZKcjV4bpZy1fbOkb1A9RvsR4BTbpzRvJ2lnYGeA2bNnt7Tv5GJ7Run/1UyUFumIiIgZRtIywFbA84AVgSUkvbN5O9sH2p5re+6sWUM1yEhEX6SQjoiImHk2Ba6zfYftfwDHA6+sOaaIoZNCOiIiYub5K/AKSYtLErAJcEXNMUUMnRTSERERM4ztc4HjgAuBS6nqgQNrDSpiCOVmw4iIiBnI9h7AHnXHETHM0iIdEREREdGBnhbSkg6RdLukPzcsW1bSqZKuKj+XmeC97ynbXCXpPb2MMyIqydmIiIjW9bpF+jDgDU3LPgucZvuFwGll/ikkLUt1uWl9YD1gj4lO3hHRVYeRnI2IiGhJTwtp22cCdzct3go4vLw+HHjzOG99PXCq7btt3wOcytNP7hHRZcnZiOEj6UWSThu7kiRpbUlfqDuuiJmgjj7SK9i+tbz+G7DCONusBNzYMH9TWfY0knaWNE/SvDvuuKO7kUYEJGcjBt1BwO7APwBsXwJsV2tEETNErTcb2jbgae4jT12K6JPkbMRAWtz2eU3LHq8lkogZpo5C+jZJzwUoP28fZ5ubgVUa5lcuyyKi/5KzEYPtTkmrUb7kSnobcOvkb4mIbqijkD4RGLuj/z3Az8bZ5tfA6yQtU25Yel1ZFhH9l5yNGGwfAb4PrCHpZmBX4IO1RhQxQ/R6+LufAH8AVpd0k6SdgK8Br5V0FbBpmUfSXEk/ALB9N7A3cH6Z9irLIqKHkrMRw8f2tbY3BWYBa9je0PYNdccVMRP09MmGtrefYNUm42w7D3h/w/whwCE9Ci0ixpGcjRg+kpajGn5yQ8CSzqb6MntXvZFFjL482TAiImK4HQXcAbwVeFt5fXStEUXMED1tkY6IiIiee67tvRvmvyxp29qiiZhB0iIdEREx3E6RtJ2kBcr0dnKzb0RfpJCOiIgYbh8Afgw8VqajgP+Q9ICk+2uNLGLEpWtHRETEELO9VN0xRMxUaZGOiIgYYpJ+KmlzSTmnR/RZy0kn6SW9DCQiIiI68j1gB+AqSV+TtHrdAUXMFO18e/0fSedJ+rCkpXsWUURERLTM9m9s7wC8DLge+I2kcyTtKGnheqOLGG0tF9K2/43qG+8qwAWSfizptT2LLCIiIlpSHsryXqqHJP0J2JeqsD51kvc8S9Jxkv5P0hWSNuhLsBEjpK2bDW1fJekLwDzgO8C6kgR8zvbxvQgwIjojaSVgVRry3PaZ9UUUEb0g6QRgdeCHwJa2by2rjpY0b5K37gv8yvbbJD0DWLzHoUaMnJYLaUlrAzsCb6T6hrul7QslrQj8AUghHTEgJH0d2Ba4HJhfFhtIIR0xeg6yfXLjAkmL2H7M9tzx3lC6aP47VSs2tv8O/L3XgUaMmnZapPcDfkDV+vzI2ELbt5RW6ogYHG8GVrf9WN2BRETPfRk4uWnZH6i6dkzkeVSPEj9U0jrABcAuth/qTYgRo6mlPtKSFgRutv3DxiJ6jO0fdj2yiJiOa4HcZBQxwiQ9R9LLgcUkrSvpZWXaiKm7aSxEVWh/z/a6wEPAZ8c5xs6S5kmad8cdd3T5N4gYfi21SNueL2kVSc8ol38iYrA9DFwk6TSqJ50BYPvj9YUUEV32eqquGSsD+wAqy+8HPjfFe28CbrJ9bpk/jnEKadsHAgcCzJ0719MPOWK0tNO14zrg95JOpPrmCoDtb3Y9qoiYrhPLFBEjyvbhwOGS3mr7pxNtJ+k9ZdvG9/5N0o2SVrd9JbAJ1T0VEdGGdgrpa8q0ADD2ONJ8O40YQLYPL3fhv6gsutL2P+qMKSJ6Y7IiutgFOHyc5R8Djix/K66lGlAgItrQTiF9ue1jGxdI2qaTg5anLh3dsOj5wBdtf7thm42An1G1hAMcb3uvTo4XMdOU/Dmc6uEMAlYprVJtj9qRfI0Yehpvoe2LgHFH9YiI1rRTSO8OHNvCsimVy0gvhSdvZAROGGfTs2xv0e7+I4J9gNeVXEPSi4CfAC9vd0fJ14ihl6vHET0yZSEtaTNgc2AlSd9pWPVM4PEuxLAJcI3tG7qwr4ioLDxWRAPY/kuXHhWcfI0YPuO2SEfE9LUy/N0tVE8yfJRqnMmx6USqO4anazuqlrLxbCDpYkm/lLTWeBtkaJ6Icc2T9ANJG5XpIKo8nq5p5SskZyO6SdICkt4+xWa/70swETPQlC3Sti8GLpb0427frFRucHgTVReRZhcCq9p+UNLmwP8CLxwnvgzNE/F0HwI+AowNd3cW8D/T2WE38hWSsxHdZPsJSZ8Gjplkm4/2MaSIGaWlB7IU60k6VdJfJF0r6TpJ107z+JsBF9q+rXmF7fttP1henwwsLGn5aR4vYkYojwb+pu2ty/StLjzlMPkaMZh+I2m38ryHZcemuoOKmAnaudnwYOD/UXXrmN+l42/PBJeJJT0HuM22Ja1HVfTf1aXjRowkScfYfrukSxnnBiPba09j98nXiMG0bfn5kYZlphphJyJ6qJ1C+j7bv+zWgSUtAbwW+I+GZR8EsH0A8DbgQ5IeBx4BtrOdy8ARk9ul/Ozq6BnJ14jBZft5dccQMVO1U0j/VtJ/A8fz1EcOX9jJgW0/BCzXtOyAhtf7A/t3su+Imcr2reVnV0fVSL5GDC5JiwOfAGbb3lnSC4HVbf+i5tAiRl47hfT65Wfj4O0GNu5eOBHRDZIe4OldO+6jGrnjk7ane39DRAyOQ6m6Xb6yzN9M9YyHFNIRPdZyIW37Nb0MJCK66tvATcCPqcaQ3Q5YjWp0jUOAjeoKLCK6bjXb20raHsD2w5IydnREH7RcSEv64njL8xjgiIH0JtvrNMwfKOki25+R9LnaooqIXvi7pMUoV6EkrUZDF8yI6J12hr97qGGaTzUU1pwexBQR0/ewpLeXhzWMPbDh0bIuNwFGjJY9gF8Bq0g6EjgN+HS9IUXMDO107dincV7SN4Bfdz2iiOiGHYB9qR7CYuCPwDtLq1UezhAxIiQtACwDbA28gqor1y6276w1sIgZop2bDZstDqzcrUAionvKzYRbTrD67H7GEhG9M/ZkQ9vHACfVHU/ETNNy1w5Jl0q6pEyXAVdS3dAUEQNG0osknSbpz2V+bUlfqDuuiOiJPNkwoibttEg3PuDhcaqnmD3e5XgiojsOAj4FfB/A9iWSfgx8udaoIqIX8mTDiJq000f6BknrAP9WFp0JXNKTqCJiuha3fV7TCFj54hsxYkof6c/aPrruWCJmona6duwCHAk8u0xHSvpYrwKLiGm5swyBNTYc1tuAW+sNKSK6zfYTVFefIqIG7XTt2AlYvzwqGElfB/4A7NeLwCJiWj4CHAisIelm4DqqkTwiYvT8RtJuwNFUQ9QCYPvu+kKKmBnaKaRFNX70mPllWUQMEEkLAh+2vamkJYAFbD9Qd1wR0TMd95Eufy/mATfb3mKq7SPiqdoppA8FzpV0Qpl/M3Bw1yOKiGmxPV/ShuX1Q1NtHxHDzfbzpvH2XYArgGd2KZyIGaWdmw2/KekMYMOyaEfbf+pJVBExXX+SdCJwLE+91Ht8fSFFRC9IWhz4BDDb9s6SXgisbvsXU7xvZeCNwFfK+yOiTS0X0pJeAVxm+8Iy/0xJ69s+t2fRRUSnFgXuAjZuWGYghXTE6DkUuAB4ZZm/mepL9KSFNNWzID4NLDXRBpJ2BnYGmD179nTjjBg57XTt+B7wsob5B8dZFhEDwPaOk62XtLvt/+xXPBHRU6vZ3lbS9gC2H1bT2JfNJG0B3G77AkkbTbSd7QOpblxm7ty57l7IEaOh5eHvANn+ZxKVIXc6fsS4pOvL0xIvkjRvnPWS9B1JV5enKaZgj+iebdp9Q3I2YmD9XdJiPDnc5WrAY1O851XAmyRdDxwFbCzpRz2NMmIEtVMIXyvp41St0AAfBq6d5vFfY/vOCdZtBrywTOuX464/zeNFRKXTEXeSsxGDZw/gV8Aqko6kKpLfO9kbbO8O7A5QWqR3s/3OnkYZMYLaaZH+IFX/q5uBm6hOkDv3IqhiK+AIV/4IPEvSc3t4vIiZpBeXaJOzEX0k6VXl5ZnA1lTF80+AubbPqCmsiBml5ULa9u22t7P9bNsr2H6H7dvH1kvavc1jGzhF0gXlZoZmKwE3NszfVJY9haSdJc2TNO+OO+5oM4SIGauTFunkbMRg+U75+Qfbd9k+yfYvJrlqNC7bZ2QM6YjOdNzHeRzbAO3cvLSh7ZslPRs4VdL/2T6z3YPmRoiIp5O0bPNTzSQ9z/Z1ZfbYDnabnI0YLP+QdCCwsqTvNK+0/fEaYoqYUdrp2jGVtlq4bN9cft4OnACs17TJzcAqDfMrl2URMbWfS/rnAxYkrQn8fGze9lfb3WFyNmLgbAGcDjxCNfxd8xQRPdbNFumWW5YaH1tcXr8O2KtpsxOBj0o6iqo/9n22b+1atBGj7atUxfQbgdWBI4AdOt1ZcjZi8JQuHEdJusL2xXXHEzETdbOQbqdFegXghDLM5ULAj23/StIHAWwfAJwMbA5cDTwMTDoubkQ8yfZJkhYGTqF62MJbbP9lGrtMzkYMrkcknQasYPvFktYG3mT7y3UHFjHq2nmyYdf6XNq+FlhnnOUHNLw28JFW9xkRIGk/nnp1aGngGqqW4o77TCZnIwbaQcCngO8D2L5E0o+BFNIRPdZOi/TPJW1m+374Z5/LY4AXQ2d9LiOi65oflJJ+khGjb3Hb5zU9zPDxuoKJmEnaKaS72ucyIrrP9uHwzz7Nj9qeX+YXBBapM7aI6Jk7y9MMx55s+DYg9ydE9EHLhXQP+lxGRO+cBmwKPFjmF6PK3VfWFlFE9MpHqIaUXEPSzcB1pKFrRvrrXi+pO4ShM/uLl07r/VMW0r3qcxkRPbWo7bEiGtsPSlq8zoAiovvK1aYP2960cXSduuOKmClaaZFOn8uI4fOQpJfZvhBA0supxpqNiBFie76kDcvrh+qOJ2KmmbKQTp/LiKG0K3CspFuohqZ8DrBtrRFFRK/8SdKJVKNn/bOYtn18fSFFzAzt3GyYPpcRQ8L2+ZLWoLoxGOBK2/+oM6aI6JlFgbuAjRuWGUghHdFj7RTS6XMZMeAkbWz7dElbN616UbmnISfWiBFje9KHH0na3fZ/9iueiJmknUI6fS4jBt+rgdOBLcdZlxaqiJlpGyCFdEQPtFNI70r6XEYMNNt7lJ95PHdEjNHUm0REJ9oZRzp9LiMGnKRPTLbe9jf7FUtEDAxPvUlEdKKVcaTT5zJieCw1ybqcTCNmprRIR/RIKy3S6XMZMSRsfwlA0uHALrbvLfPLAPvUGFpE9IikZW3f3bTsebavK7PH1hBWxIzQyjjS6XMZMXzWHiuiAWzfI2ndGuOJiN75uaTNbN8PIGlN4BjgxQC2v9r8BkmrAEcAK1A1ih1oe9/+hRwxGlrp2pE+lxHDZwFJy9i+B6oWK9q7uTgihsdXqYrpN1Ldx3QEsMMU73kc+KTtCyUtBVwg6VTbl/c41oiR0sqJtet9Llv5JixpI+BnwNilqeNt79XJ8SJmoH2AP0gau6S7DfCVTnaUfI0YbLZPkrQw1UPSlgLeYvsvU7znVuDW8voBSVcAKwEppCPa0ErXjl70uWz1m/BZtrfo8BgRM5btIyTN48knnW09jZam5GvEAJK0H09t0FoauAb4aBkM4OMt7mcOsC5w7jjrdgZ2Bpg9e/Z0Q44YOe1c6u1an8t8E47ovVLoTjunkq8RA2te0/wF7e5A0pLAT4Fdx/pYN7J9IHAgwNy5czPyT0STdgrpnvS5nOybMLCBpIuBW4DdbF82zvvzbTmiT6abr2UfydmILrB9OICkJYBHbc8v8wsCi0z1/tId5KfAkRnKNqIzC7Sx7Vify70l7Q2cA/zXdA4+xTfhC4FVba8D7Af873j7sH2g7bm2586aNWs64UTEJLqRr5CcjeiB04DFGuYXA34z2RskCTgYuCKDBkR0ruVC2vYRwNbAbWXa2vYPOz3wVN+Ebd9v+8Hy+mRgYUnLd3q8iOhc8jVioC06ln8A5fXiU7znVcC7gI0lXVSmzXsZZMQoaqtrRrf6XLbyTVjSc4DbbFvSelRF/13TPXZEtCf5GjHwHpL0MtsXAkh6OfDIZG+wfTZ54mHEtNU1ruzYN+FLJV1Uln0OmA1g+wDgbcCHJD1O9QdhO9u50SGi/5KvEYNtV+BYSbdQFcfPAbatNaKIGaKWQrqVb8K29wf2709EETGR5GvEYLN9vqQ1qB7GAnCl7X/UGVPETJEnnUVERAwhSRvbPl3S1k2rXlTGkc5IHBE9lkI6IiJiOL0aOB3Ycpx1BlJIR/RYCumIiIghZHuP8nPHumOJmKlSSEdERAwhSZ+YbH3Gh47ovRTSERERw2mpSdZl1JyIPkghHRERMYRsfwlA0uHALrbvLfPLUD2NOCJ6rJ1HhEdERMTgWXusiAawfQ+wbn3hRMwcKaQjIiKG2wKlFRoAScuSK84RfZFEi4iIGG77AH+QdGyZ3wb4So3xRMwYKaQjIiKGmO0jJM0DNi6LtrZ9eZ0xRcwUKaQjIiKGXCmcUzxH9Fn6SEdEREREdCCFdEREREREB1JIR0RERER0IIV0REREREQHUkhHRERERHSgtkJa0hskXSnpakmfHWf9IpKOLuvPlTSnhjAjokjORoyWqXI6IqZWSyEtaUHgu8BmwJrA9pLWbNpsJ+Ae2y8AvgV8vb9RRsSY5GzEaGkxpyNiCnW1SK8HXG37Wtt/B44CtmraZivg8PL6OGATSepjjBHxpORsxGhpJacjYgp1PZBlJeDGhvmbgPUn2sb245LuA5YD7mzcSNLOwM5l9kFJV/Yk4mjF8jT9+8TE9I33dHN3q3ZzZ+NIzo6e5Gubhixnp9JKTo9avg7k//ku/78aVAP52QOwR0vtPRPm69A/2dD2gcCBdccRIGme7bl1xxGDLTk7GJKv0YpRytf8n6/PKH/2dXXtuBlYpWF+5bJs3G0kLQQsDdzVl+giollyNmK0tJLTETGFugrp84EXSnqepGcA2wEnNm1zIjB2veNtwOm23ccYI+JJydmI0dJKTkfEFGrp2lH6T34U+DWwIHCI7csk7QXMs30icDDwQ0lXA3dTJXkMtpG4/BdPl5wdScnXGWyinK45rF7L//n6jOxnrzQYRURERES0L082jIiIiIjoQArpiIiIiIgOpJCOiIiIiOhACumIiIiI6BpJa0jaRNKSTcvfUFdMvZJCOrpO0o51xxARETGenKN6S9LHgZ8BHwP+LKnx0fNfrSeq3smoHdF1kv5qe3bdcURERDTLOaq3JF0KbGD7QUlzgOOAH9reV9KfbK9bb4TdNfSPCI96SLpkolXACv2MJSImJ+k5wB7AE8AXqVqK3gpcAexi+9Yaw4voupyjarWA7QcBbF8vaSPgOEmrUn3+IyWFdHRqBeD1wD1NywWc0/9wImIShwEnAUsAvwWOBDYH3gwcAGw10RsjhlTOUfW5TdJLbV8EUFqmtwAOAV5Sa2Q9kEI6OvULYMmxRGkk6Yy+RxMRk1nB9n4Akj5s++tl+X6SdqoxroheyTmqPu8GHm9cYPtx4N2Svl9PSL2TPtIRESNO0sW21ymvv2z7Cw3rLrU9cq1EERH9kFE7IiJG38/GhqFqKqJfAFxZW1QREUMuLdIRETOYpB1tH1p3HBERwygt0jEuSW3djCFpI0m/6FU8EdEzX6o7gIiZJufY0ZGbDWNctl9ZdwwR0R0ZCixisOQcOzrSIh3jkvRg+bmRpDMkHSfp/yQdKUll3RvKsguBrRveu4SkQySdJ+lPY081krSvpC+W16+XdKak/B+M6L0VqO6k33Kc6a4a44qYkXKOHR1pkY5WrAusBdwC/B54laR5wEHAxsDVwNEN238eON32+yQ9CzhP0m+A3YHzJZ0FfAfY3PYT/fs1ImasDAUWMbhyjh1i+aYSrTjP9k0lIS8C5gBrANfZvsrVHas/atj+dcBnJV0EnAEsCsy2/TDwAeBUYH/b1/TtN4iYwWzvZPvsCda9o9/xRMRT5Bw7xNIiHa14rOH1fKb+fyPgrbbHG1brJVSXklfsUmwRERHDLOfYIZYW6ejU/wFzJK1W5rdvWPdr4GMN/bzWLT9XBT5JdRlrM0nr9zHeiIiIYZFz7JBIIR0dsf0osDNwUrkR4vaG1XsDCwOXSLoM2Lsk/MHAbrZvAXYCfiBp0T6HHjFyMpRWxGjJOXZ45IEsEREzjKSNqE64W9QcSkTEUEuLdETEkMtQWhER9cjNhhERoyVDaUVE9ElaFyIiRkuG0oqI6JO0SEdEjJYMpRUR0SdpkY6IGH0ZSisiogdSSEdEjLgMpRUR0RsZ/i4iIiIiogNpkY6IiIiI6EAK6YiIiIiIDqSQjoiIiIjoQArpiIiIiIgOpJCOiIiIiOhACumIiIiIiA6kkI6IiIiI6MD/D+6OH9D/ve96AAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# 分析用户点击环境变化是否明显,这里随机采样10个用户分析这些用户的点击环境分布\n",
+ "sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=10, replace=False)\n",
+ "sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)]\n",
+ "cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region','click_referrer_type']\n",
+ "for _, user_df in sample_users.groupby('user_id'):\n",
+ " plot_envs(user_df, cols, 2, 3)"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "# 分析用户点击环境变化是否明显,这里随机采样10个用户分析这些用户的点击环境分布\n",
- "sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=10, replace=False)\n",
- "sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)]\n",
- "cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region','click_referrer_type']\n",
- "for _, user_df in sample_users.groupby('user_id'):\n",
- " plot_envs(user_df, cols, 2, 3)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以看出绝大多数数的用户的点击环境是比较固定的。思路:可以基于这些环境的统计特征来代表该用户本身的属性"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击新闻数量的分布"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:40:04.296033Z",
- "start_time": "2020-11-13T15:40:03.980868Z"
- }
- },
- "outputs": [
+ },
{
- "data": {
- "text/plain": [
- "[]"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看出绝大多数数的用户的点击环境是比较固定的。思路:可以基于这些环境的统计特征来代表该用户本身的属性"
]
- },
- "execution_count": 33,
- "metadata": {},
- "output_type": "execute_result"
},
{
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAASw0lEQVR4nO3da4yc1X3H8e8fr+/GVxZj1nZsgnshVCl0RYyS8iLkBm1qKpGIqCpWimSpJU1SWjXQvEjUV0nUQEMTkTghFamilIRQYVW0gQJRlRdxsgbCNYSNa8CLsZeLL/EFbHz6Yo6dsbPjZ9be2Znn+PuRrH2e85yZ55x9xr+ZOXP2TKSUkCSV64xuN0CS1FkGvSQVzqCXpMIZ9JJUOINekgrX1+0GAJx11llpxYoV3W6GJNXKpk2bXk4p9VfV64mgX7FiBUNDQ91uhiTVSkQ81049h24kqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSpcrYP+p1te5eb7nuGNQ4e73RRJ6lm1DvqHn3uNWx8c5tBhg16SWql10EuSqhn0klQ4g16SCmfQS1Lhigh6v99cklqrddBHdLsFktT7ah30kqRqBr0kFc6gl6TCGfSSVLgigt5JN5LUWq2DPnDajSRVqXXQS5KqGfSSVDiDXpIKV0TQJ9dAkKSWah30LoEgSdVqHfSSpGoGvSQVzqCXpMIZ9JJUuCKC3jk3ktRaEUEvSWrNoJekwhn0klS4toI+Iv4mIp6MiCci4jsRMSMiVkbExogYjog7I2Jarjs97w/n4ys62gNJ0glVBn1EDAAfBwZTShcCU4BrgM8Dt6SUzgdeA67LN7kOeC2X35LrSZK6pN2hmz5gZkT0AbOAbcC7gbvy8TuAq/L2mrxPPn55RGcXK3CpG0lqrTLoU0ojwD8Bz9MI+F3AJmBnSulQrrYVGMjbA8AL+baHcv1Fx99vRKyLiKGIGBodHT2pxnf4+UOSitDO0M0CGq/SVwLnArOBD5zqiVNK61NKgymlwf7+/lO9O0lSC+0M3bwH+L+U0mhK6SBwN/BOYH4eygFYCozk7RFgGUA+Pg94ZUJbLUlqWztB/zywOiJm5bH2y4GngIeAq3OdtcA9eXtD3icffzC5YLwkdU07Y/QbaXyo+jDweL7NeuBTwA0RMUxjDP72fJPbgUW5/Abgxg60W5LUpr7qKpBS+gzwmeOKNwOXjFH3APChU2/aOPh+QZJaqvVfxjrnRpKq1TroJUnVDHpJKpxBL0mFM+glqXBFBH1y2o0ktVTroHepG0mqVuuglyRVM+glqXAGvSQVzqCXpMIVEfSujSlJrdU66J10I0nVah30kqRqBr0kFc6gl6TCFRH0fhYrSa3VOujDNRAkqVKtg16SVM2gl6TCGfSSVDiDXpIKV0TQJ9dAkKSWah30TrqRpGq1DnpJUjWDXpIKZ9BLUuEMekkqXBFB75wbSWqt1kHvpBtJqlbroJckVTPoJalwBr0kFa6toI+I+RFxV0T8PCKejohLI2JhRNwfEc/mnwty3YiIWyNiOCIei4iLO9sFSdKJtPuK/kvAf6eUfgd4O/A0cCPwQEppFfBA3ge4AliV/60DbpvQFo/BpW4kqbXKoI+IecBlwO0AKaU3Uko7gTXAHbnaHcBVeXsN8K3U8GNgfkQsmeB2H2lcR+5WkkrSziv6lcAo8K8R8UhEfCMiZgOLU0rbcp2XgMV5ewB4oen2W3OZJKkL2gn6PuBi4LaU0kXAXn49TANAaqwTPK4BlIhYFxFDETE0Ojo6nptKksahnaDfCmxNKW3M+3fRCP7tR4Zk8s8d+fgIsKzp9ktz2TFSSutTSoMppcH+/v6Tbb8kqUJl0KeUXgJeiIjfzkWXA08BG4C1uWwtcE/e3gBcm2ffrAZ2NQ3xSJImWV+b9f4a+HZETAM2Ax+l8STx3Yi4DngO+HCuey9wJTAM7Mt1Oyq52o0ktdRW0KeUHgUGxzh0+Rh1E3D9qTWrPc65kaRq/mWsJBXOoJekwhn0klQ4g16SCldG0DvpRpJaqnXQu9SNJFWrddBLkqoZ9JJUOINekgpXRND7WawktVbroA8XQZCkSrUOeklSNYNekgpn0EtS4Qx6SSpcEUGfnHYjSS3VOuhdAkGSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQR9crUbSWqp1kHvpBtJqlbroJckVTPoJalwBr0kFc6gl6TCFRH0rnUjSa3VOuhd60aSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQS9k24kqbW2gz4ipkTEIxHxn3l/ZURsjIjhiLgzIqbl8ul5fzgfX9GhthOudiNJlcbziv4TwNNN+58HbkkpnQ+8BlyXy68DXsvlt+R6kqQuaSvoI2Ip8EfAN/J+AO8G7spV7gCuyttr8j75+OW5viSpC9p9Rf/PwN8Dh/P+ImBnSulQ3t8KDOTtAeAFgHx8V65/jIhYFxFDETE0Ojp6cq2XJFWqDPqI+GNgR0pp00SeOKW0PqU0mFIa7O/vn8i7liQ16WujzjuBP4mIK4EZwFzgS8D8iOjLr9qXAiO5/giwDNgaEX3APOCVCW95k+RiN5LUUuUr+pTSTSmlpSmlFcA1wIMppT8DHgKuztXWAvfk7Q15n3z8wdSpJHbkX5Iqnco8+k8BN0TEMI0x+Ntz+e3Aolx+A3DjqTVRknQq2hm6OSql9EPgh3l7M3DJGHUOAB+agLZJkiZAEX8ZK0lqrYig97NYSWqt1kHvZ7GSVK3WQS9JqmbQS1LhDHpJKpxBL0mFM+glqXC1DnpXP5akarUOeklSNYNekgpn0EtS4Qx6SSpcEUHvWjeS1Fqtg945N5JUrdZBL0mqZtBLUuEMekkqnEEvSYUrIugTTruRpFZqHfQudSNJ1Wod9JKkaga9JBXOoJekwhn0klS4IoLetW4kqbVaB72zbiSpWq2DXpJUzaCXpMIZ9JJUOINekgpXRNA76UaSWqt10IffMSVJlSqDPiKWRcRDEfFURDwZEZ/I5Qsj4v6IeDb/XJDLIyJujYjhiHgsIi7udCckSa2184r+EPC3KaULgNXA9RFxAXAj8EBKaRXwQN4HuAJYlf+tA26b8FZLktpWGfQppW0ppYfz9h7gaWAAWAPckavdAVyVt9cA30oNPwbmR8SSiW64JKk94xqjj4gVwEXARmBxSmlbPvQSsDhvDwAvNN1say47/r7WRcRQRAyNjo6Ot93HSK6BIEkttR30ETEH+D7wyZTS7uZjqZG040rblNL6lNJgSmmwv79/PDdtatNJ3UySTittBX1ETKUR8t9OKd2di7cfGZLJP3fk8hFgWdPNl+YySVIXtDPrJoDbgadTSjc3HdoArM3ba4F7msqvzbNvVgO7moZ4JEmTrK+NOu8E/hx4PCIezWX/AHwO+G5EXAc8B3w4H7sXuBIYBvYBH53IBkuSxqcy6FNKP4KWf5l0+Rj1E3D9KbZLkjRBav2XsUc450aSWisi6CVJrRn0klQ4g16SCmfQS1LhDHpJKlwRQe9SN5LUWq2DPlzsRpIq1TroJUnVDHpJKpxBL0mFM+glqXC1Dvq9rx8C4I1Dh7vcEknqXbUO+gWzpgJw2PmVktRSrYN+xtQpALzuK3pJaqnWQT+9rxH0Dt1IUmv1Dvqpjea//KvXu9wSSepdtQ76aVMazT/Dv5CVpJZqHfQLZ08DYMsre7vcEknqXbUO+vl51s3BNx2jl6RWah30s6Y1vtv82R2/6nJLJKl31TroARbNnsbTL+7udjMkqWfVPujPmTeDzS87Ri9JrdQ+6N+xchEAv9i+p8stkaTeVPug/+DblwDw9f/d3OWWSFJvqn3QX7R8AQDf27T16CJnkqRfq33QA/zjmrcB8MEv/6jLLZGk3lNE0F976QrOP3sOm0f3ctkXHmLPgYPdbpIk9Ywigh7g3o//IcsXzuL5V/fxe5+9jy/e94yLnUkSEKkH1nIfHBxMQ0NDE3JfN9//C2594Nmj++/53bN5/9vO4f0XnsPcGVMn5ByS1AsiYlNKabCyXmlBD7DvjUP8y4PD3P3wVrbv/vXKlufMncFFy+dz4cA8Llo2n7cNzGPO9D6mnOGiaJLq57QO+mbbdx9gw6Mv8pMtr/LY1p3HBP8R5/XPZmD+TM6aM53zz57DnOl9rDxrNnNnTmXZgpksmjO9I22TpFNh0Ldw8M3DPDGyi0ee38nW1/azY88Bntq2m9cPHmZk5/4T3nbx3OmcM3cG0Fg5c/nCWUePLVs46+hqmkDjiePMY58gBubPPPqtWJJ0qtoN+r4OnfwDwJeAKcA3Ukqf68R5TsbUKWdw0fIFR+ffN9v/xpvsP/gmW17Zy659B9m26wDbdx8AGn95u//gmwBseXkvz726j0de2ElKsGt/+7N8zpz+m7/y/jOnc+78mWPWnzltCr+1eM4J73PZgmOfZE7kwoF59I1jqGp63xTmzfKzDanOJjzoI2IK8BXgvcBW4KcRsSGl9NREn2uizZw2hZnTprQdmkfs2neQnfvfOLo/snM/o3uOHSJ6/pV9vLbvN58Qhkd/xd7XDx19Emm2bed+dux5nYd+vqPluQ8d7vw7sgWzpk7IO5HxPCFVmTV9CqvOPnNC7msiNJ6sZ0zqOd/aP4czZ3TktZpOwYy+KZzRY5/7deJRcgkwnFLaDBAR/w6sAXo+6E/WvFlTj3nV+5ZFsyft3Dt2H2DHnva+SvGpF3dz8HD7U04PvZl4YmQXE/EFXk+M7GbX/oPjevfTyos797PHv4JWj4qA8/tP/C682ccvX8UH335uB1vUmaAfAF5o2t8KvOP4ShGxDlgHsHz58g404/Rw9twZnD23vVeSFw7M63BrJs+BMd4Bdcure9/g+Vf3Teo5n3tlLzvHeIeo7npx535Gx/kd1vNmdn5otGvv+1JK64H10PgwtlvtUD310ofa586f2fIzlk5Zfd6iST2f6q0Tfxk7Aixr2l+ayyRJXdCJoP8psCoiVkbENOAaYEMHziNJasOED92klA5FxMeAH9CYXvnNlNKTE30eSVJ7OjJGn1K6F7i3E/ctSRqfYlavlCSNzaCXpMIZ9JJUOINekgrXE6tXRsQo8NxJ3vws4OUJbE4d2OfTg30+PZxKn9+SUuqvqtQTQX8qImKonWU6S2KfTw/2+fQwGX126EaSCmfQS1LhSgj69d1uQBfY59ODfT49dLzPtR+jlySdWAmv6CVJJ2DQS1Lhah30EfGBiHgmIoYj4sZut2e8ImJLRDweEY9GxFAuWxgR90fEs/nnglweEXFr7utjEXFx0/2szfWfjYi1TeV/kO9/ON920r/IMiK+GRE7IuKJprKO97HVObrY589GxEi+1o9GxJVNx27K7X8mIt7fVD7m4zsvAb4xl9+ZlwMnIqbn/eF8fMUkdZmIWBYRD0XEUxHxZER8IpcXe61P0Ofeu9YppVr+o7EE8i+B84BpwM+AC7rdrnH2YQtw1nFlXwBuzNs3Ap/P21cC/wUEsBrYmMsXApvzzwV5e0E+9pNcN/Jtr+hCHy8DLgaemMw+tjpHF/v8WeDvxqh7QX7sTgdW5sf0lBM9voHvAtfk7a8Cf5m3/wr4at6+BrhzEvu8BLg4b58J/CL3rdhrfYI+99y1ntT/9BP8S74U+EHT/k3ATd1u1zj7sIXfDPpngCVND6Rn8vbXgI8cXw/4CPC1pvKv5bIlwM+byo+pN8n9XMGxodfxPrY6Rxf73Oo//zGPWxrf43Bpq8d3DrmXgb5cfrTekdvm7b5cL7p0ze8B3ns6XOsx+txz17rOQzdjfQn5QJfacrIScF9EbIrGl6UDLE4pbcvbLwGL83ar/p6ofOsY5b1gMvrY6hzd9LE8TPHNpuGF8fZ5EbAzpXTouPJj7isf35XrT6o8jHARsJHT5Fof12fosWtd56AvwbtSShcDVwDXR8RlzQdT4+m66Pmvk9HHHvk93ga8Ffh9YBvwxa62pkMiYg7wfeCTKaXdzcdKvdZj9LnnrnWdg772X0KeUhrJP3cA/wFcAmyPiCUA+eeOXL1Vf09UvnSM8l4wGX1sdY6uSCltTym9mVI6DHydxrWG8ff5FWB+RPQdV37MfeXj83L9SRERU2kE3rdTSnfn4qKv9Vh97sVrXeegr/WXkEfE7Ig488g28D7gCRp9ODLTYC2NcT9y+bV5tsJqYFd+u/oD4H0RsSC/RXwfjXG8bcDuiFidZydc23Rf3TYZfWx1jq44EkTZn9K41tBo5zV5FsVKYBWNDx3HfHznV6wPAVfn2x//+zvS56uBB3P9jsu//9uBp1NKNzcdKvZat+pzT17rbnxoMYEfflxJ45PuXwKf7nZ7xtn282h8uv4z4Mkj7acxzvYA8CzwP8DCXB7AV3JfHwcGm+7rL4Dh/O+jTeWD+UH2S+DLdOGDOeA7NN6+HqQxxnjdZPSx1Tm62Od/y316LP8nXdJU/9O5/c/QNDOq1eM7P3Z+kn8X3wOm5/IZeX84Hz9vEvv8LhpDJo8Bj+Z/V5Z8rU/Q55671i6BIEmFq/PQjSSpDQa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKtz/A1/NmoIeUlAfAAAAAElFTkSuQmCC\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击新闻数量的分布"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "user_click_item_count = sorted(user_click_merge.groupby('user_id')['click_article_id'].count(), reverse=True)\n",
- "plt.plot(user_click_item_count)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以根据用户的点击文章次数看出用户的活跃度"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 34,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 34,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#点击次数在前50的用户\n",
- "plt.plot(user_click_item_count[:50])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "点击次数排前50的用户的点击次数都在100次以上。思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 35,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXEAAAD4CAYAAAAaT9YAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAARV0lEQVR4nO3dfYxc1X3G8eexd7ExEDAYjEPYrkOQFZekKUxT2lKgJQHHSuWGphJIDaRYWaUBKUitKJQqRWlTNYnaSFWippvaMonASZsUGSVtg4tSXKkYYqd+WQqYlwLxSzAvcYgIBYxP/5i7u6Nl786dmTt7z5n7/UjWzt6Z3fmdnfGjM+ece65DCAIApGlB1QUAALpHiANAwghxAEgYIQ4ACSPEASBhQ/P5ZMuWLQujo6Pz+ZQAkLydO3c+H0I4fbb75jXER0dHtWPHjvl8SgBInu2n8+5jOAUAEkaIA0DCCHEASBghDgAJI8QBIGFtQ9z2RtuHbU/Mct8f2g62l/WnPADAXIr0xDdJWjPzoO2zJV0u6ZmSawIAFNR2nXgIYZvt0Vnu+oKkmyRtKbuome59+Fnt/uGRfj/Nm5yy5Dh99FdHtWCB5/25AaCIrk72sb1O0oEQwm577oCzPSZpTJJGRka6eTrdt+85fW177lr3vpjcZv2SVafrnNNPnNfnBoCiOg5x20sk/YmaQylthRDGJY1LUqPR6OoKFJ9ed54+ve68bn60a9/ec1A33PnfeuMYF80AEK9uVqecI2mlpN22n5L0Nkk/sH1mmYXFggsfAYhZxz3xEMJeSWdMfp8FeSOE8HyJdVXOYhwcQPyKLDHcLOl+Sats77e9vv9lVW9yqD+IrjiAeBVZnXJ1m/tHS6smIpP9cIZTAMSMMzZzTPXECXEAESPE22A4BUDMCPFcTGwCiB8hnoPhFAApIMRz0A8HkAJCvA164gBiRojnaLcnDADEgBDPMbVOnNUpACJGiOdgYhNACgjxHNOn3QNAvAhxAEgYIZ5jchfDwHgKgIgR4nkYTgGQAEI8B7sYAkgBIZ5jep04KQ4gXoQ4ACSMEM/BcAqAFBDiOVgnDiAFhHiO6SWGFRcCAHMgxHOw/xWAFBDibXCyD4CYEeI5WGAIIAWEeB52MQSQAEI8x9TEJn1xABEjxHMwsQkgBYR4O3TEAUSMEM/BxCaAFBDiOSY3wGJiE0DMCPEcjIkDSAEhnoOr3QNIASHeBsMpAGLWNsRtb7R92PZEy7E/t73H9i7b99h+a3/LnH/sYgggBUV64pskrZlx7PMhhHeHEN4j6duSPlVyXRFgUBxA/IbaPSCEsM326IxjL7V8e4IGsMM6tKAZ4us3fV8LIp/lPGHRQm25/iKNnLak6lIAzLO2IZ7H9mckXSPpJ5J+Y47HjUkak6SRkZFun27erX7rW3TTmlX66f8drbqUOR088oq27DqoA0deIcSBGuo6xEMIt0q61fYtkm6Q9Gc5jxuXNC5JjUYjmR778MIF+sSl76i6jLa2P/mCtuw6yCoaoKbKWJ1yh6TfKeH3oAtTAz1kOFBLXYW47XNbvl0n6ZFyykG3yHCgntoOp9jeLOlSScts71dz2GSt7VWSjkl6WtLH+1kk8rE9AFBvRVanXD3L4Q19qAVdiHzhDIA+44zNxLE9AFBvhHjizGXkgFojxAcEGQ7UEyGevMmJTWIcqCNCPHFMbAL1RognjsvIAfVGiCfO7JkL1BohPiBYYgjUEyGeuKnhFDIcqCVCPHFMbAL1RognzmLvFKDOCPHEMa8J1BshPiA42QeoJ0J8QBDhQD0R4oljYhOoN0I8cUxsAvVGiCfOXGQTqDVCfEDQEwfqiRBPHEsMgXojxBNnMbMJ1BkhnjguzwbUGyGeOC6UDNQbIT4g6IkD9USIJ46JTaDeCPHkMbEJ1BkhnrjpiU364kAdEeKJox8O1BshPiDoiAP1RIgnbvJq9ywxBOqJEE8cwylAvbUNcdsbbR+2PdFy7PO2H7G9x/Zdtk/pa5XIxRmbQL0V6YlvkrRmxrGtks4LIbxb0j5Jt5RcFwpiP3Gg3obaPSCEsM326Ixj97R8u13Sh0uuCx26b99zOvLK61WX0bMVJy/W2netqLoMIBltQ7yA6yR9I+9O22OSxiRpZGSkhKdDq5OXDOvERUO6e/dB3b37YNXllGLvbZfrpMXDVZcBJKGnELd9q6Sjku7Ie0wIYVzSuCQ1Gg0+9Jfs5OOHteNP36dXjx6rupSe3fnAM/rsvz2io2/wNgGK6jrEbX9U0gclXRY4XbBSi4cXavHwwqrL6Nnxw80pGt5MQHFdhbjtNZJuknRJCOFn5ZaEurJZMAl0qsgSw82S7pe0yvZ+2+slfVHSSZK22t5l+8t9rhM1wD4wQOeKrE65epbDG/pQC2pu+gIXAIrijE1Eh444UBwhjniwDwzQMUIc0WBaE+gcIY5omEFxoGOEOKJDhgPFEeKIBpt5AZ0jxBGNqXXi9MWBwghxRIOJTaBzhDiiwQUugM4R4ogOGQ4UR4gjGtMTm8Q4UBQhjngwnAJ0jBBHNJjYBDpHiANAwghxRGPyohAMpwDFEeKIxvTWKaQ4UBQhjmiwThzoHCEOAAkjxBGN6b1TABRFiCManOwDdI4QRzToiQOdI8QRHTriQHGEOAAkjBBHNMxFNoGOEeKIxlSEk+FAYYQ4osHEJtA5QhzRoScOFEeIIxpmM1qgY4Q4osHV7oHOEeKIBhObQOcIcUSDXQyBzrUNcdsbbR+2PdFy7HdtP2T7mO1Gf0tE3TCcAhRXpCe+SdKaGccmJF0paVvZBaHOmNgEOjXU7gEhhG22R2cce1hqPcMO6N2C7O30e//wgIYWDv5I37lnnKg7P3Zh1WUgcW1DvFe2xySNSdLIyEi/nw4J++WVp+m6X1upV15/o+pS+m7vgSP6rydeqLoMDIC+h3gIYVzSuCQ1Gg0GO5Hr5CXD+tRvra66jHnxha37NHHgparLwAAY/M+sQISmV+LQr0FvCHGgQmQ4elVkieFmSfdLWmV7v+31tj9ke7+kX5H0Hdvf7XehwCCZuhRdxXUgfUVWp1ydc9ddJdcC1AYLu1AWhlOACkxvMUBfHL0hxIEKsHc6ykKIAxWiI45eEeJABSbPdmafGPSKEAeAhBHiQAXYdhdlIcSBCnApOpSFEAcqRE8cvSLEgQpwPVGUhRAHKsBgCspCiAMVYGITZSHEgQqwARbKQogDFWLvFPSKEAcqwN4pKAshDgAJI8SBCkztnUJXHD0ixIEKTC0xJMTRI0IcqBAn+6BXhDhQAdaJoyyEOFABzthEWQhxoALTF4UAekOIAxWYHk4hxtEbQhyoEBGOXhHiQAUmx8TpiKNXhDhQBTO1iXIQ4kAFpnriDKigR4Q4UAFPpzjQE0IcqBAZjl4R4kAFpi4KQYqjR4Q4UAHmNVGWtiFue6Ptw7YnWo6danur7ceyr0v7WyYwWJjYRFmK9MQ3SVoz49jNku4NIZwr6d7sewAFsQEWyjLU7gEhhG22R2ccXifp0uz27ZL+Q9Ifl1kYUAf/sveQli45ruoyorRoeIHev3q5Fg0trLqUqLUN8RzLQwiHsts/krQ874G2xySNSdLIyEiXTwcMlmUnLpIk/cV3Hq64krj9/Ucu0BU/f2bVZUSt2xCfEkIItnM/FIYQxiWNS1Kj0eDDIyDpsncu1/ZbLtNrR49VXUqUnn7xZX1kw4N6lb9PW92G+LO2V4QQDtleIelwmUUBdXDmyYurLiFar73RDG92eWyv2yWGd0u6Nrt9raQt5ZQDACzB7ESRJYabJd0vaZXt/bbXS/orSe+3/Zik92XfA0ApyPDiiqxOuTrnrstKrgUAJLVc+YjRlLY4YxNAtDgZqj1CHEB0uGhGcYQ4gOgwsVkcIQ4gOuzyWBwhDiA6U3vLVFtGEghxANHiZJ/2CHEA0SLC2yPEAUSHic3iCHEA0TGD4oUR4gCiw5WPiiPEAUSLec32CHEA0WE0pThCHEB0zD6GhRHiAKLDhaSLI8QBRIeJzeIIcQDRoifeHiEOID5MbBZGiAOIDhObxRHiAKJjrgpRGCEOIDrTE5tohxAHEC064u0R4gCiM321e1K8HUIcQHSY1iyOEAcQHfZOKY4QBxAdLpRcHCEOIFpkeHuEOID4TG2ARYy3Q4gDiA7X2CyOEAcQHTK8OEIcQHSm14lXXEgCCHEA0WI/8fZ6CnHbn7Q9Yfsh2zeWVBOAmmP/q+K6DnHb50n6mKT3SvoFSR+0/Y6yCgNQX0xsFjfUw8++U9IDIYSfSZLt+yRdKelzZRQGoL4mT/b5yn8+qW/u3F9xNeX4yyvfpV8aPbX039tLiE9I+ozt0yS9ImmtpB0zH2R7TNKYJI2MjPTwdADqYvHwAn38knP0zIsvV11KaY4fXtiX3+teFtPbXi/pE5JelvSQpFdDCDfmPb7RaIQdO96U8wCAOdjeGUJozHZfTxObIYQNIYQLQggXS/qxpH29/D4AQGd6GU6R7TNCCIdtj6g5Hn5hOWUBAIroKcQlfSsbE39d0vUhhCO9lwQAKKqnEA8h/HpZhQAAOscZmwCQMEIcABJGiANAwghxAEhYTyf7dPxk9nOSnu7yx5dJer7EclJAm+uBNtdDL23+uRDC6bPdMa8h3gvbO/LOWBpUtLkeaHM99KvNDKcAQMIIcQBIWEohPl51ARWgzfVAm+uhL21OZkwcAPBmKfXEAQAzEOIAkLAkQtz2GtuP2n7c9s1V19ML20/Z3mt7l+0d2bFTbW+1/Vj2dWl23Lb/Nmv3Htvnt/yea7PHP2b72qraMxvbG20ftj3Rcqy0Ntq+IPsbPp79bOVXZMxp8222D2Sv9S7ba1vuuyWr/1HbV7Qcn/W9bnul7Qey49+wfdz8tW52ts+2/T3b/5NdLP2T2fGBfa3naHN1r3UIIep/khZKekLS2yUdJ2m3pNVV19VDe56StGzGsc9Jujm7fbOkz2a310r6VzUv/n2hmtc0laRTJT2ZfV2a3V5addta2nOxpPMlTfSjjZIezB7r7Gc/EGmbb5P0R7M8dnX2Pl4kaWX2/l4413td0j9Kuiq7/WVJfxBBm1dIOj+7fZKaF4VZPciv9Rxtruy1TqEn/l5Jj4cQngwhvCbp65LWVVxT2dZJuj27fbuk3245/tXQtF3SKbZXSLpC0tYQwoshhB9L2ippzTzXnCuEsE3SizMOl9LG7L63hBC2h+a7/Kstv6syOW3Os07S10MIr4YQ/lfS42q+z2d9r2e9z9+U9M3s51v/fpUJIRwKIfwgu/1TSQ9LOksD/FrP0eY8fX+tUwjxsyT9sOX7/Zr7jxa7IOke2zvdvIi0JC0PIRzKbv9I0vLsdl7bU/yblNXGs7LbM4/H6oZs6GDj5LCCOm/zaZKOhBCOzjgeDdujkn5R0gOqyWs9o81SRa91CiE+aC4KIZwv6QOSrrd9ceudWY9joNd91qGNmb+TdI6k90g6JOmvK62mT2yfKOlbkm4MIbzUet+gvtaztLmy1zqFED8g6eyW79+WHUtSCOFA9vWwpLvU/Fj1bPbRUdnXw9nD89qe4t+krDYeyG7PPB6dEMKzIYQ3QgjHJH1Fzdda6rzNL6g59DA043jlbA+rGWZ3hBD+OTs80K/1bG2u8rVOIcS/L+ncbMb2OElXSbq74pq6YvsE2ydN3pZ0uaQJNdszOSN/raQt2e27JV2TzepfKOkn2cfU70q63PbS7GPb5dmxmJXSxuy+l2xfmI0fXtPyu6IyGWSZD6n5WkvNNl9le5HtlZLOVXMCb9b3etab/Z6kD2c/3/r3q0z2998g6eEQwt+03DWwr3Vemyt9rauc6S36T81Z7X1qzubeWnU9PbTj7WrOQu+W9NBkW9QcB7tX0mOS/l3SqdlxS/pS1u69khotv+s6NSdJHpf0+1W3bUY7N6v5kfJ1Ncf01pfZRkmN7D/JE5K+qOzM4wjb/LWsTXuy/8wrWh5/a1b/o2pZcZH3Xs/eOw9mf4t/krQogjZfpOZQyR5Ju7J/awf5tZ6jzZW91px2DwAJS2E4BQCQgxAHgIQR4gCQMEIcABJGiANAwghxAEgYIQ4ACft/AbwTsfQSxAYAAAAASUVORK5CYII=\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#点击次数排名在[25000:50000]之间\n",
- "plt.plot(user_click_item_count[25000:50000])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻点击次数分析"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:42:14.526476Z",
- "start_time": "2020-11-13T15:42:14.463642Z"
- }
- },
- "outputs": [],
- "source": [
- "item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'].count(), reverse=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:42:16.198000Z",
- "start_time": "2020-11-13T15:42:16.044455Z"
- }
- },
- "outputs": [
+ },
{
- "data": {
- "text/plain": [
- "[]"
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:40:04.296033Z",
+ "start_time": "2020-11-13T15:40:03.980868Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAASw0lEQVR4nO3da4yc1X3H8e8fr+/GVxZj1nZsgnshVCl0RYyS8iLkBm1qKpGIqCpWimSpJU1SWjXQvEjUV0nUQEMTkTghFamilIRQYVW0gQJRlRdxsgbCNYSNa8CLsZeLL/EFbHz6Yo6dsbPjZ9be2Znn+PuRrH2e85yZ55x9xr+ZOXP2TKSUkCSV64xuN0CS1FkGvSQVzqCXpMIZ9JJUOINekgrX1+0GAJx11llpxYoV3W6GJNXKpk2bXk4p9VfV64mgX7FiBUNDQ91uhiTVSkQ81049h24kqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSpcrYP+p1te5eb7nuGNQ4e73RRJ6lm1DvqHn3uNWx8c5tBhg16SWql10EuSqhn0klQ4g16SCmfQS1Lhigh6v99cklqrddBHdLsFktT7ah30kqRqBr0kFc6gl6TCGfSSVLgigt5JN5LUWq2DPnDajSRVqXXQS5KqGfSSVDiDXpIKV0TQJ9dAkKSWah30LoEgSdVqHfSSpGoGvSQVzqCXpMIZ9JJUuCKC3jk3ktRaEUEvSWrNoJekwhn0klS4toI+Iv4mIp6MiCci4jsRMSMiVkbExogYjog7I2Jarjs97w/n4ys62gNJ0glVBn1EDAAfBwZTShcCU4BrgM8Dt6SUzgdeA67LN7kOeC2X35LrSZK6pN2hmz5gZkT0AbOAbcC7gbvy8TuAq/L2mrxPPn55RGcXK3CpG0lqrTLoU0ojwD8Bz9MI+F3AJmBnSulQrrYVGMjbA8AL+baHcv1Fx99vRKyLiKGIGBodHT2pxnf4+UOSitDO0M0CGq/SVwLnArOBD5zqiVNK61NKgymlwf7+/lO9O0lSC+0M3bwH+L+U0mhK6SBwN/BOYH4eygFYCozk7RFgGUA+Pg94ZUJbLUlqWztB/zywOiJm5bH2y4GngIeAq3OdtcA9eXtD3icffzC5YLwkdU07Y/QbaXyo+jDweL7NeuBTwA0RMUxjDP72fJPbgUW5/Abgxg60W5LUpr7qKpBS+gzwmeOKNwOXjFH3APChU2/aOPh+QZJaqvVfxjrnRpKq1TroJUnVDHpJKpxBL0mFM+glqXBFBH1y2o0ktVTroHepG0mqVuuglyRVM+glqXAGvSQVzqCXpMIVEfSujSlJrdU66J10I0nVah30kqRqBr0kFc6gl6TCFRH0fhYrSa3VOujDNRAkqVKtg16SVM2gl6TCGfSSVDiDXpIKV0TQJ9dAkKSWah30TrqRpGq1DnpJUjWDXpIKZ9BLUuEMekkqXBFB75wbSWqt1kHvpBtJqlbroJckVTPoJalwBr0kFa6toI+I+RFxV0T8PCKejohLI2JhRNwfEc/mnwty3YiIWyNiOCIei4iLO9sFSdKJtPuK/kvAf6eUfgd4O/A0cCPwQEppFfBA3ge4AliV/60DbpvQFo/BpW4kqbXKoI+IecBlwO0AKaU3Uko7gTXAHbnaHcBVeXsN8K3U8GNgfkQsmeB2H2lcR+5WkkrSziv6lcAo8K8R8UhEfCMiZgOLU0rbcp2XgMV5ewB4oen2W3OZJKkL2gn6PuBi4LaU0kXAXn49TANAaqwTPK4BlIhYFxFDETE0Ojo6nptKksahnaDfCmxNKW3M+3fRCP7tR4Zk8s8d+fgIsKzp9ktz2TFSSutTSoMppcH+/v6Tbb8kqUJl0KeUXgJeiIjfzkWXA08BG4C1uWwtcE/e3gBcm2ffrAZ2NQ3xSJImWV+b9f4a+HZETAM2Ax+l8STx3Yi4DngO+HCuey9wJTAM7Mt1Oyq52o0ktdRW0KeUHgUGxzh0+Rh1E3D9qTWrPc65kaRq/mWsJBXOoJekwhn0klQ4g16SCldG0DvpRpJaqnXQu9SNJFWrddBLkqoZ9JJUOINekgpXRND7WawktVbroA8XQZCkSrUOeklSNYNekgpn0EtS4Qx6SSpcEUGfnHYjSS3VOuhdAkGSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQR9crUbSWqp1kHvpBtJqlbroJckVTPoJalwBr0kFc6gl6TCFRH0rnUjSa3VOuhd60aSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQS9k24kqbW2gz4ipkTEIxHxn3l/ZURsjIjhiLgzIqbl8ul5fzgfX9GhthOudiNJlcbziv4TwNNN+58HbkkpnQ+8BlyXy68DXsvlt+R6kqQuaSvoI2Ip8EfAN/J+AO8G7spV7gCuyttr8j75+OW5viSpC9p9Rf/PwN8Dh/P+ImBnSulQ3t8KDOTtAeAFgHx8V65/jIhYFxFDETE0Ojp6cq2XJFWqDPqI+GNgR0pp00SeOKW0PqU0mFIa7O/vn8i7liQ16WujzjuBP4mIK4EZwFzgS8D8iOjLr9qXAiO5/giwDNgaEX3APOCVCW95k+RiN5LUUuUr+pTSTSmlpSmlFcA1wIMppT8DHgKuztXWAvfk7Q15n3z8wdSpJHbkX5Iqnco8+k8BN0TEMI0x+Ntz+e3Aolx+A3DjqTVRknQq2hm6OSql9EPgh3l7M3DJGHUOAB+agLZJkiZAEX8ZK0lqrYig97NYSWqt1kHvZ7GSVK3WQS9JqmbQS1LhDHpJKpxBL0mFM+glqXC1DnpXP5akarUOeklSNYNekgpn0EtS4Qx6SSpcEUHvWjeS1Fqtg945N5JUrdZBL0mqZtBLUuEMekkqnEEvSYUrIugTTruRpFZqHfQudSNJ1Wod9JKkaga9JBXOoJekwhn0klS4IoLetW4kqbVaB72zbiSpWq2DXpJUzaCXpMIZ9JJUOINekgpXRNA76UaSWqt10IffMSVJlSqDPiKWRcRDEfFURDwZEZ/I5Qsj4v6IeDb/XJDLIyJujYjhiHgsIi7udCckSa2184r+EPC3KaULgNXA9RFxAXAj8EBKaRXwQN4HuAJYlf+tA26b8FZLktpWGfQppW0ppYfz9h7gaWAAWAPckavdAVyVt9cA30oNPwbmR8SSiW64JKk94xqjj4gVwEXARmBxSmlbPvQSsDhvDwAvNN1say47/r7WRcRQRAyNjo6Ot93HSK6BIEkttR30ETEH+D7wyZTS7uZjqZG040rblNL6lNJgSmmwv79/PDdtatNJ3UySTittBX1ETKUR8t9OKd2di7cfGZLJP3fk8hFgWdPNl+YySVIXtDPrJoDbgadTSjc3HdoArM3ba4F7msqvzbNvVgO7moZ4JEmTrK+NOu8E/hx4PCIezWX/AHwO+G5EXAc8B3w4H7sXuBIYBvYBH53IBkuSxqcy6FNKP4KWf5l0+Rj1E3D9KbZLkjRBav2XsUc450aSWisi6CVJrRn0klQ4g16SCmfQS1LhDHpJKlwRQe9SN5LUWq2DPlzsRpIq1TroJUnVDHpJKpxBL0mFM+glqXC1Dvq9rx8C4I1Dh7vcEknqXbUO+gWzpgJw2PmVktRSrYN+xtQpALzuK3pJaqnWQT+9rxH0Dt1IUmv1Dvqpjea//KvXu9wSSepdtQ76aVMazT/Dv5CVpJZqHfQLZ08DYMsre7vcEknqXbUO+vl51s3BNx2jl6RWah30s6Y1vtv82R2/6nJLJKl31TroARbNnsbTL+7udjMkqWfVPujPmTeDzS87Ri9JrdQ+6N+xchEAv9i+p8stkaTeVPug/+DblwDw9f/d3OWWSFJvqn3QX7R8AQDf27T16CJnkqRfq33QA/zjmrcB8MEv/6jLLZGk3lNE0F976QrOP3sOm0f3ctkXHmLPgYPdbpIk9Ywigh7g3o//IcsXzuL5V/fxe5+9jy/e94yLnUkSEKkH1nIfHBxMQ0NDE3JfN9//C2594Nmj++/53bN5/9vO4f0XnsPcGVMn5ByS1AsiYlNKabCyXmlBD7DvjUP8y4PD3P3wVrbv/vXKlufMncFFy+dz4cA8Llo2n7cNzGPO9D6mnOGiaJLq57QO+mbbdx9gw6Mv8pMtr/LY1p3HBP8R5/XPZmD+TM6aM53zz57DnOl9rDxrNnNnTmXZgpksmjO9I22TpFNh0Ldw8M3DPDGyi0ee38nW1/azY88Bntq2m9cPHmZk5/4T3nbx3OmcM3cG0Fg5c/nCWUePLVs46+hqmkDjiePMY58gBubPPPqtWJJ0qtoN+r4OnfwDwJeAKcA3Ukqf68R5TsbUKWdw0fIFR+ffN9v/xpvsP/gmW17Zy659B9m26wDbdx8AGn95u//gmwBseXkvz726j0de2ElKsGt/+7N8zpz+m7/y/jOnc+78mWPWnzltCr+1eM4J73PZgmOfZE7kwoF59I1jqGp63xTmzfKzDanOJjzoI2IK8BXgvcBW4KcRsSGl9NREn2uizZw2hZnTprQdmkfs2neQnfvfOLo/snM/o3uOHSJ6/pV9vLbvN58Qhkd/xd7XDx19Emm2bed+dux5nYd+vqPluQ8d7vw7sgWzpk7IO5HxPCFVmTV9CqvOPnNC7msiNJ6sZ0zqOd/aP4czZ3TktZpOwYy+KZzRY5/7deJRcgkwnFLaDBAR/w6sAXo+6E/WvFlTj3nV+5ZFsyft3Dt2H2DHnva+SvGpF3dz8HD7U04PvZl4YmQXE/EFXk+M7GbX/oPjevfTyos797PHv4JWj4qA8/tP/C682ccvX8UH335uB1vUmaAfAF5o2t8KvOP4ShGxDlgHsHz58g404/Rw9twZnD23vVeSFw7M63BrJs+BMd4Bdcure9/g+Vf3Teo5n3tlLzvHeIeo7npx535Gx/kd1vNmdn5otGvv+1JK64H10PgwtlvtUD310ofa586f2fIzlk5Zfd6iST2f6q0Tfxk7Aixr2l+ayyRJXdCJoP8psCoiVkbENOAaYEMHziNJasOED92klA5FxMeAH9CYXvnNlNKTE30eSVJ7OjJGn1K6F7i3E/ctSRqfYlavlCSNzaCXpMIZ9JJUOINekgrXE6tXRsQo8NxJ3vws4OUJbE4d2OfTg30+PZxKn9+SUuqvqtQTQX8qImKonWU6S2KfTw/2+fQwGX126EaSCmfQS1LhSgj69d1uQBfY59ODfT49dLzPtR+jlySdWAmv6CVJJ2DQS1Lhah30EfGBiHgmIoYj4sZut2e8ImJLRDweEY9GxFAuWxgR90fEs/nnglweEXFr7utjEXFx0/2szfWfjYi1TeV/kO9/ON920r/IMiK+GRE7IuKJprKO97HVObrY589GxEi+1o9GxJVNx27K7X8mIt7fVD7m4zsvAb4xl9+ZlwMnIqbn/eF8fMUkdZmIWBYRD0XEUxHxZER8IpcXe61P0Ofeu9YppVr+o7EE8i+B84BpwM+AC7rdrnH2YQtw1nFlXwBuzNs3Ap/P21cC/wUEsBrYmMsXApvzzwV5e0E+9pNcN/Jtr+hCHy8DLgaemMw+tjpHF/v8WeDvxqh7QX7sTgdW5sf0lBM9voHvAtfk7a8Cf5m3/wr4at6+BrhzEvu8BLg4b58J/CL3rdhrfYI+99y1ntT/9BP8S74U+EHT/k3ATd1u1zj7sIXfDPpngCVND6Rn8vbXgI8cXw/4CPC1pvKv5bIlwM+byo+pN8n9XMGxodfxPrY6Rxf73Oo//zGPWxrf43Bpq8d3DrmXgb5cfrTekdvm7b5cL7p0ze8B3ns6XOsx+txz17rOQzdjfQn5QJfacrIScF9EbIrGl6UDLE4pbcvbLwGL83ar/p6ofOsY5b1gMvrY6hzd9LE8TPHNpuGF8fZ5EbAzpXTouPJj7isf35XrT6o8jHARsJHT5Fof12fosWtd56AvwbtSShcDVwDXR8RlzQdT4+m66Pmvk9HHHvk93ga8Ffh9YBvwxa62pkMiYg7wfeCTKaXdzcdKvdZj9LnnrnWdg772X0KeUhrJP3cA/wFcAmyPiCUA+eeOXL1Vf09UvnSM8l4wGX1sdY6uSCltTym9mVI6DHydxrWG8ff5FWB+RPQdV37MfeXj83L9SRERU2kE3rdTSnfn4qKv9Vh97sVrXeegr/WXkEfE7Ig488g28D7gCRp9ODLTYC2NcT9y+bV5tsJqYFd+u/oD4H0RsSC/RXwfjXG8bcDuiFidZydc23Rf3TYZfWx1jq44EkTZn9K41tBo5zV5FsVKYBWNDx3HfHznV6wPAVfn2x//+zvS56uBB3P9jsu//9uBp1NKNzcdKvZat+pzT17rbnxoMYEfflxJ45PuXwKf7nZ7xtn282h8uv4z4Mkj7acxzvYA8CzwP8DCXB7AV3JfHwcGm+7rL4Dh/O+jTeWD+UH2S+DLdOGDOeA7NN6+HqQxxnjdZPSx1Tm62Od/y316LP8nXdJU/9O5/c/QNDOq1eM7P3Z+kn8X3wOm5/IZeX84Hz9vEvv8LhpDJo8Bj+Z/V5Z8rU/Q55671i6BIEmFq/PQjSSpDQa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKtz/A1/NmoIeUlAfAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "user_click_item_count = sorted(user_click_merge.groupby('user_id')['click_article_id'].count(), reverse=True)\n",
+ "plt.plot(user_click_item_count)"
]
- },
- "execution_count": 37,
- "metadata": {},
- "output_type": "execute_result"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以根据用户的点击文章次数看出用户的活跃度"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 38,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 38,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count[:100])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以看出点击次数最多的前100篇新闻,点击次数大于1000次"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 39,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count[:20])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "点击次数最多的前20篇新闻,点击次数大于2500。思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 40,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count[3500:])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以发现很多新闻只被点击过一两次。思路:可以定义这些新闻是冷门新闻"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻共现频次:两篇新闻连续出现的次数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 433597.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 3.184139 \n",
- " \n",
- " \n",
- " std \n",
- " 18.851753 \n",
- " \n",
- " \n",
- " min \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 2.000000 \n",
- " \n",
- " \n",
- " max \n",
- " 2202.000000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " count\n",
- "count 433597.000000\n",
- "mean 3.184139\n",
- "std 18.851753\n",
- "min 1.000000\n",
- "25% 1.000000\n",
- "50% 1.000000\n",
- "75% 2.000000\n",
- "max 2202.000000"
- ]
- },
- "execution_count": 41,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tmp = user_click_merge.sort_values('click_timestamp')\n",
- "tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].transform(lambda x:x.shift(-1))\n",
- "union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].agg({'count'}).reset_index().sort_values('count', ascending=False)\n",
- "union_item[['count']].describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "由统计数据可以看出,平均共现次数3.18,最高为2202。\n",
- "\n",
- "说明用户看的新闻,相关性是比较强的。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 42,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#画个图直观地看一看\n",
- "x = union_item['click_article_id']\n",
- "y = union_item['count']\n",
- "plt.scatter(x, y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 43,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 43,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAD4CAYAAADvsV2wAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAATdElEQVR4nO3df6xkZX3H8fe37Aq2EPmxN7pd9nKhmhgxuOB1hUANISHlV+CPYrqkRUTNNoopVlsrmiCamIhNlSpG3ApF1Cr4syuFWFqwahuW7OKy/BK9KgYQ3AVkkarU1W//mLMwd5hzZ+7MmTt3znm/ksmeOeeZOd89s/dzn32ec85EZiJJqr/fG3cBkqSlYeBLUkMY+JLUEAa+JDWEgS9JDbFiXDtetWpVzszMjGv3kjSRtm3b9mhmTg3y2rEF/szMDFu3bh3X7iVpIkXETwZ9rUM6ktQQBr4kNYSBL0kNYeBLUkMY+JLUEH0HfkTsExHfjYjru2zbNyKujYi5iNgSETOVVilJGtpievgXAveWbHsj8PPMfDHwEeDSYQuTJFWrr/PwI+JQ4HTgA8DbuzQ5C7ikWP4ScHlERI7g3sv3PfIL/m3HT0u3n/CSKdYffnDVu5WkidfvhVeXAe8EDijZvgZ4ACAz90TEbuAQ4NH2RhGxEdgIMD09PUC5MLfzKT52y1zXbZlw648f57q/PG6g95akOusZ+BFxBrAzM7dFxInD7CwzNwGbAGZnZwfq/Z9+1GpOP+r0rtv+/FO38vRvfjd4gZJUY/2M4R8PnBkR9wNfAE6KiM92tHkIWAsQESuAFwCPVVhn3/z+LknqrmfgZ+ZFmXloZs4AG4CbM/MvOpptBs4rls8u2pi9krSMDHzztIh4P7A1MzcDVwKfiYg54HFavxiWXBD4e0aSultU4GfmN4FvFssXt63/NfDaKguTJFWrVlfaRoy7AklavmoV+OCkrSSVqV3gS5K6q13gO2crSd3VLvAlSd3VKvAjwjF8SSpRq8CXJJWrVeB7VqYklatV4APO2kpSiVoFvhdeSVK5WgU+eOGVJJWpXeBLkrqrVeAHDuFLUplaBb4kqVytAj+ctZWkUrUKfIB02laSuqpV4Nu/l6RytQp8cNJWksrULvAlSd3VKvAj7OFLUplaBb4kqVzNAt9pW0kqU7PA9146klSmVoHvdVeSVK5n4EfEfhFxW0TcERF3R8T7urR5fUTsiojtxeNNoym3t3TWVpK6WtFHm6eBkzLzqYhYCXwnIm7MzFs72l2bmW+tvkRJUhV6Bn62usxPFU9XFo9l2Y12REeSyvU1hh8R+0TEdmAncFNmbunS7E8jYkdEfCki1pa8z8aI2BoRW3ft2jV41ZKkResr8DPzt5m5DjgUWB8RL+9o8nVgJjOPAm4CPl3yPpsyczYzZ6empoYouzsnbSWp3KLO0snMJ4BbgFM61j+WmU8XTz8FvLKS6gbgnK0kddfPWTpTEXFgsfx84GTgex1tVrc9PRO4t8Ia+xaO4ktSqX7O0lkNfDoi9qH1C+K6zLw+It4PbM3MzcBfRcSZwB7gceD1oyq4F++HL0nd9XOWzg7g6C7rL25bvgi4qNrSJElVqt2Vto7hS1J3tQp8SVK5WgW+p2VKUrlaBT4s00uAJWkZqFXge1qmJJWrVeCDd8uUpDK1C3xJUnf1CvxwDF+SytQr8CVJpWoV+E7ZSlK5WgU+4JiOJJWoVeCHV15JUqlaBT7YwZekMrULfElSd7UK/MALrySpTK0CX5JUrlaB75ytJJWrVeCDk7aSVKZWgW8HX5LK1Srwwa84lKQytQt8SVJ3tQr8iCAdxZekrmoV+JKkcrUKfCdtJalcz8CPiP0i4raIuCMi7o6I93Vps29EXBsRcxGxJSJmRlJtH5y0laTu+unhPw2clJmvANYBp0TEsR1t3gj8PDNfDHwEuLTSKvtlF1+SSq3o1SBbN6d5qni6snh09qPPAi4plr8EXB4RkWO4sc3j//t//O0X7xj49RvWr+WVhx1cYUWStDz0DHyAiNgH2Aa8GPh4Zm7paLIGeAAgM/dExG7gEODRjvfZCGwEmJ6eHq7yLtbPHMytP3yM/557tHfjLh558tckGPiSaqmvwM/M3wLrIuJA4KsR8fLMvGuxO8vMTcAmgNnZ2cp7/xvWT7Nh/eC/SI7/4M0VViNJy8uiztLJzCeAW4BTOjY9BKwFiIgVwAuAxyqob8k56Suprvo5S2eq6NkTEc8HTga+19FsM3BesXw2cPM4xu8lSeX6GdJZDXy6GMf/PeC6zLw+It4PbM3MzcCVwGciYg54HNgwsopHzCt1JdVVP2fp7ACO7rL+4rblXwOvrbY0SVKVanWl7bD8AhVJdWbgd3JER1JNGfht7OFLqjMDv4MdfEl1ZeBLUkMY+G2CwMsHJNWVgS9JDWHgt3HSVlKdGfgdHNCRVFcGfhs7+JLqzMDv4JytpLoy8CWpIQz8NhHhGL6k2jLwJakhDPw2TtpKqjMDv4NX2kqqKwO/nV18STVm4Hewfy+prgx8SWoIA79NgF18SbVl4EtSQxj4bcLbZUqqMQO/QzqmI6mmDPw29u8l1VnPwI+ItRFxS0TcExF3R8SFXdqcGBG7I2J78bh4NOWOntddSaqrFX202QO8IzNvj4gDgG0RcVNm3tPR7tuZeUb1JUqSqtCzh5+ZD2fm7cXyL4B7gTWjLmwcIuzhS6qvRY3hR8QMcDSwpcvm4yLijoi4MSKOLHn9xojYGhFbd+3atfhqJUkD6zvwI2J/4MvA2zLzyY7NtwOHZeYrgI8BX+v2Hpm5KTNnM3N2ampqwJJHJ5y2lVRjfQV+RKykFfafy8yvdG7PzCcz86li+QZgZUSsqrTSJeJpmZLqqp+zdAK4Erg3Mz9c0uZFRTsiYn3xvo9VWehS8LorSXXWz1k6xwPnAndGxPZi3buBaYDMvAI4G3hzROwBfgVsyAm9sfxkVi1JvfUM/Mz8Dj2uScrMy4HLqypKklQ9r7TtYAdfUl0Z+JLUEAZ+G++WKanODPwOTtpKqisDv439e0l1ZuA/h118SfVk4EtSQxj4bbxbpqQ6M/AlqSEM/DYRjuBLqi8DX5IawsBv4/3wJdWZgd9hQm/yKUk9GfiS1BAGfhsnbSXVmYEvSQ1h4LcJvPBKUn0Z+JLUEAZ+O++HL6nGDPwOjuhIqisDX5IawsBv05q0tY8vqZ4MfElqCAO/jXO2kuqsZ+BHxNqIuCUi7omIuyPiwi5tIiI+GhFzEbEjIo4ZTbmSpEGt6KPNHuAdmXl7RBwAbIuImzLznrY2pwIvKR6vBj5R/DlR7OBLqrOegZ+ZDwMPF8u/iIh7gTVAe+CfBVyTrRnPWyPiwIhYXbx2otz50G7OvXLLuMt4jvOPn+Gkl75w3GVImmD99PCfEREzwNFAZyKuAR5oe/5gsW5e4EfERmAjwPT09CJLHb0zjvpDvr7jpzz19J5xlzLP3Q89ydQB+xr4kobSd+BHxP7Al4G3ZeaTg+wsMzcBmwBmZ2eX3fmPbzjhcN5wwuHjLuM5Trj05nGXIKkG+jpLJyJW0gr7z2XmV7o0eQhY2/b80GKdqrLsfj1KmjT9nKUTwJXAvZn54ZJmm4HXFWfrHAvsnsTxe0mqs36GdI4HzgXujIjtxbp3A9MAmXkFcANwGjAH/BI4v/JKG8wvZpFUhX7O0vkOPc5YLM7OuaCqoiRJ1fNK2wkQXiEgqQIG/oTwpm6ShmXgTwDv8SOpCgb+hLB/L2lYBr4kNYSBPwFaX8wy7iokTToDX5IawsCfABHhGL6koRn4ktQQBv4E8KxMSVUw8CeEF15JGpaBL0kNYeBPAu+WKakCBr4kNYSBPwEC7OJLGpqBL0kNYeBPgPB2mZIqYOBPiHRMR9KQDHxJaggDfwJ4t0xJVTDwJakhDPwJEGEPX9LwDHxJaggDfwKE98uUVIGegR8RV0XEzoi4q2T7iRGxOyK2F4+Lqy9TnpYpaVgr+mhzNXA5cM0Cbb6dmWdUUpEkaSR69vAz81vA40tQi0o4aSupClWN4R8XEXdExI0RcWRZo4jYGBFbI2Lrrl27Ktq1JKkfVQT+7cBhmfkK4GPA18oaZuamzJzNzNmpqakKdt0cdvAlDWvowM/MJzPzqWL5BmBlRKwaujJJUqWGDvyIeFEUt3OMiPXFez427PvqWd4tU1IVep6lExGfB04EVkXEg8B7gZUAmXkFcDbw5ojYA/wK2JB+43blPKKShtUz8DPznB7bL6d12qYkaRnzStsJ0BrQsYsvaTgGviQ1hIE/AbzwSlIVDHxJaggDfwJ4VqakKhj4E8IRHUnDMvAngPfDl1QFA39CeC2bpGEZ+JLUEAb+BIhwDF/S8Ax8SWoIA38COGUrqQoG/oRwzlbSsAz8SeCVV5IqYOBPCDv4koZl4EtSQxj4EyDwwitJwzPwJakhDPwJ4JytpCoY+JLUEAb+BLCDL6kKBv6EcM5W0rAMfElqCAN/AkQE6aVXkobUM/Aj4qqI2BkRd5Vsj4j4aETMRcSOiDim+jIlScPqp4d/NXDKAttPBV5SPDYCnxi+LLVz0lZSFVb0apCZ34qImQWanAVck61LQW+NiAMjYnVmPlxVkYLbf/IEJ3/4v8ZdhqQK/Nmr1vKmPz5iyffbM/D7sAZ4oO35g8W65wR+RGyk9b8ApqenK9h1M5x73GF84+5Hxl2GpIqs2n/fsey3isDvW2ZuAjYBzM7OOgvZp7PWreGsdWvGXYakCVfFWToPAWvbnh9arJMkLSNVBP5m4HXF2TrHArsdv5ek5afnkE5EfB44EVgVEQ8C7wVWAmTmFcANwGnAHPBL4PxRFStJGlw/Z+mc02N7AhdUVpEkaSS80laSGsLAl6SGMPAlqSEMfElqiBjXl2NHxC7gJwO+fBXwaIXlVMnaFm+51gXWNojlWhfUo7bDMnNqkB2MLfCHERFbM3N23HV0Y22Lt1zrAmsbxHKtC6zNIR1JaggDX5IaYlIDf9O4C1iAtS3ecq0LrG0Qy7UuaHhtEzmGL0lavEnt4UuSFsnAl6SmyMyJetD6ft37aN2d810j3M/9wJ3AdmBrse5g4CbgB8WfBxXrA/hoUdMO4Ji29zmvaP8D4Ly29a8s3n+ueG0sUMtVwE7grrZ1I6+lbB991HYJre9E2F48TmvbdlGxn/uAP+n1uQKHA1uK9dcCzyvW71s8nyu2z3TUtRa4BbgHuBu4cLkctwVqG+txA/YDbgPuKOp63xDvVUm9fdR2NfDjtmO2bkw/B/sA3wWuXy7HrGuWjCowR/EoDuoPgSOA5xUf/stGtK/7gVUd6z6094AD7wIuLZZPA24s/pEdC2xp+4fyo+LPg4rlvQFzW9E2iteeukAtrwGOYX6ojryWsn30UdslwN90afuy4jPbt/jH+sPiMy39XIHrgA3F8hXAm4vltwBXFMsbgGs79rWa4occOAD4frH/sR+3BWob63Er/h77F8sraYXJsYt9ryrr7aO2q4Gzuxyzpf45eDvwLzwb+GM/Zl2zZBRhOaoHcBzwjbbnFwEXjWhf9/PcwL8PWN32Q3tfsfxJ4JzOdsA5wCfb1n+yWLca+F7b+nntSuqZYX6ojryWsn30UdsldA+ueZ8X8I3iM+36uRY/eI8CKzo//72vLZZXFO0W+l/SvwInL6fj1qW2ZXPcgN8Hbgdevdj3qrLekuPVXtvVdA/8Jfs8aX3L338CJwHXD3L8R33M9j4mbQy/7AvTRyGBf4+IbcWXrwO8MJ/9Nq9HgBf2qGuh9Q92Wb8YS1FL2T768daI2BERV0XEQQPWdgjwRGbu6VLbM68ptu8u2j9HRMwAR9PqFS6r49ZRG4z5uEXEPhGxndYw3U20epeLfa8q620/VvNqy8y9x+wDxTH7SETs/Xbwpfw8LwPeCfyueD7I8R/JMes0aYG/lE7IzGOAU4ELIuI17Ruz9Ws1x1JZh6WoZZH7+ATwR8A64GHgH0ZUVk8RsT/wZeBtmflk+7ZxH7cutY39uGXmbzNzHa1e63rgpUtdQ5nO2iLi5bR6uy8FXkVrmObvRlzDvM8zIs4AdmbmtlHutyqTFvhL9oXpmflQ8edO4Ku0/vH/LCJWAxR/7uxR10LrD+2yfjGWopayfSwoM39W/HD+DvgnWsdukNoeAw6MiBUd6+e9V7H9BUX7Z0TESlqB+rnM/EqPv9OSHrdutS2X41bU8gStieXjBnivKut9jrbaTsnMh7PlaeCfGfyYDfp5Hg+cGRH3A1+gNazzjwv8fcZyzJ7Ra8xnOT1ojXn9iNakxt4JjCNHsJ8/AA5oW/4fWjPlf8/8yZsPFcunM3+C6LZi/cG0ziA4qHj8GDi42NY5QXRaj5pmmD9OPvJayvbRR22r25b/GvhCsXwk8yemfkRrUqr0cwW+yPyJqbcUyxcwf/Lruo6aArgGuKxj/diP2wK1jfW4AVPAgcXy84FvA2cs9r2qrLeP2la3HdPLgA+O8efgRJ6dtB37MetaY9VhOeoHrdn379MaW3zPiPZxRHFg954C9p5i/SG0Jmd+APxH2z+UAD5e1HQnMNv2Xm+gddrUHHB+2/pZ4K7iNZez8ITj52n9F/83tMbq3rgUtZTto4/aPlPsewewmflB9p5iP/fRdmZS2edafBa3FTV/Edi3WL9f8Xyu2H5ER10n0Pqv9w7aTnNcDsdtgdrGetyAo2idWrij+HtdPMR7VVJvH7XdXByzu4DP8uyZPEv6c1C0O5FnA3/sx6zbw1srSFJDTNoYviRpQAa+JDWEgS9JDWHgS1JDGPiS1BAGviQ1hIEvSQ3x/4tppPoWqYdUAAAAAElFTkSuQmCC\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(union_item['count'].values[40000:])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "大概有75000个pair至少共现一次"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻文章信息"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 44,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#不同类型的新闻出现的次数\n",
- "plt.plot(user_click_merge['category_id'].value_counts().values)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 45,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#出现次数比较少的新闻类型, 有些新闻类型,基本上就出现过几次\n",
- "plt.plot(user_click_merge['category_id'].value_counts().values[150:])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "count 1.630633e+06\n",
- "mean 2.043012e+02\n",
- "std 6.382198e+01\n",
- "min 0.000000e+00\n",
- "25% 1.720000e+02\n",
- "50% 1.970000e+02\n",
- "75% 2.290000e+02\n",
- "max 6.690000e+03\n",
- "Name: words_count, dtype: float64"
- ]
- },
- "execution_count": 46,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#新闻字数的描述性统计\n",
- "user_click_merge['words_count'].describe()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 47,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 47,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAEJCAYAAAB4yveGAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAsUUlEQVR4nO3deZxcVZn/8c+TFUgCCaQJSAgJEkBAEcgAioMLyuYMYRQVNyJmJjMj+tNxG1BnUBBEGVFQCEYWg8gSEAUhJIQQlpi1Q8hCFtJZOnu60+l0p5N0p5fn90fd7lR3qrqWruV23+/79epXV506deu5Vbeee+ucc881d0dERKKhV7EDEBGRwlHSFxGJECV9EZEIUdIXEYkQJX0RkQhR0hcRiZCUSd/MTjOzt+L+as3sW2Z2tJnNMLM1wf8hQX0zs3vMrMzMlprZuXHLGhfUX2Nm4/K5YiIicijLZJy+mfUGtgAXADcAu9z9DjO7ERji7v9tZlcC3wCuDOrd7e4XmNnRQCkwBnBgEXCeu1fndI1ERCSpTJt3LgHWuns5MBaYHJRPBq4Obo8FHvGYecBgMzseuAyY4e67gkQ/A7i8qysgIiLp65Nh/WuBx4Pbw9x9W3B7OzAsuH0CsCnuOZuDsmTlSQ0dOtRHjhyZYYgiItG2aNGine5ekuixtJO+mfUDrgJu6viYu7uZ5WQ+BzObAEwAGDFiBKWlpblYrIhIZJhZebLHMmneuQJ40913BPd3BM02BP8rgvItwIlxzxselCUrb8fdJ7n7GHcfU1KScEclIiJZyiTpf56DTTsAzwGtI3DGAc/GlV8XjOK5EKgJmoGmA5ea2ZBgpM+lQZmIiBRIWs07ZjYA+ATw73HFdwBTzGw8UA58NiifSmzkThmwD7gewN13mdmtwMKg3i3uvqvLayAiImnLaMhmoY0ZM8bVpi8ikhkzW+TuYxI9pjNyRUQiRElfRCRClPRFRCJESV8i6Y01lZRX7S12GCIFl+kZuSI9wpcfXADAhjs+WeRIRApLR/oiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEaKkLyISIWklfTMbbGZPm9kqM1tpZh8ws6PNbIaZrQn+DwnqmpndY2ZlZrbUzM6NW864oP4aMxuXr5USEZHE0j3SvxuY5u6nA2cDK4EbgZnuPhqYGdwHuAIYHfxNACYCmNnRwM3ABcD5wM2tOwoRESmMlEnfzI4CLgYeBHD3A+6+GxgLTA6qTQauDm6PBR7xmHnAYDM7HrgMmOHuu9y9GpgBXJ7DdRERkRTSOdIfBVQCD5vZYjN7wMwGAMPcfVtQZzswLLh9ArAp7vmbg7Jk5SIiUiDpJP0+wLnARHc/B9jLwaYcANzdAc9FQGY2wcxKzay0srIyF4sUEZFAOkl/M7DZ3ecH958mthPYETTbEPyvCB7fApwY9/zhQVmy8nbcfZK7j3H3MSUlJZmsi4iIpJAy6bv7dmCTmZ0WFF0CrACeA1pH4IwDng1uPwdcF4ziuRCoCZqBpgOXmtmQoAP30qBMREQKpE+a9b4B/MnM+gHrgOuJ7TCmmNl4oBz4bFB3KnAlUAbsC+ri7rvM7FZgYVDvFnfflZO1EBGRtKSV9N39LWBMgocuSVDXgRuSLOch4KEM4hMRkRzSGbkiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFpJX0z22Bmy8zsLTMrDcqONrMZZrYm+D8kKDczu8fMysxsqZmdG7eccUH9NWY2Lj+rJCIiyWRypP9Rd3+/u48J7t8IzHT30cDM4D7AFcDo4G8CMBFiOwngZuAC4Hzg5tYdhYiIFEZXmnfGApOD25OBq+PKH/GYecBgMzseuAyY4e673L0amAFc3oXXFxGRDKWb9B14ycwWmdmEoGyYu28Lbm8HhgW3TwA2xT13c1CWrFxERAqkT5r1PuTuW8zsWGCGma2Kf9Dd3cw8FwEFO5UJACNGjMjFIkVEJJDWkb67bwn+VwB/IdYmvyNotiH4XxFU3wKcGPf04UFZsvKOrzXJ3ce4+5iSkpLM1kZERDqVMumb2QAzG9R6G7gUWA48B7SOwBkHPBvcfg64LhjFcyFQEzQDTQcuNbMhQQfupUGZiIgUSDrNO8OAv5hZa/3H3H2amS0EppjZeKAc+GxQfypwJVAG7AOuB3D3XWZ2K7AwqHeLu+/K2ZqIiEhKKZO+u68Dzk5QXgVckqDcgRuSLOsh4KHMwxQRkVzQGbkiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEZJ20jez3ma22MyeD+6PMrP5ZlZmZk+aWb+gvH9wvyx4fGTcMm4Kyleb2WU5XxsREelUJkf63wRWxt3/OfArdz8FqAbGB+Xjgeqg/FdBPczsDOBa4EzgcuA+M+vdtfBFRCQTaSV9MxsOfBJ4ILhvwMeAp4Mqk4Grg9tjg/sEj18S1B8LPOHuDe6+HigDzs/BOoiISJrSPdL/NfB9oCW4fwyw292bgvubgROC2ycAmwCCx2uC+m3lCZ4jIiIFkDLpm9k/ARXuvqgA8WBmE8ys1MxKKysrC/GSIiKRkc6R/kXAVWa2AXiCWLPO3cBgM+sT1BkObAlubwFOBAgePwqoii9P8Jw27j7J3ce4+5iSkpKMV0hERJJLmfTd/SZ3H+7uI4l1xL7i7l8EZgHXBNXGAc8Gt58L7hM8/oq7e1B+bTC6ZxQwGliQszUREZGU+qSuktR/A0+Y2U+BxcCDQfmDwB/NrAzYRWxHgbu/bWZTgBVAE3CDuzd34fVFRCRDGSV9d38VeDW4vY4Eo2/cvR74TJLn3wbclmmQ3dHufQfYWXeAU44dWOxQsrazroG6+iZGDh1Q7FBEJEd0Rm6eXHn3G3z8rteKHUaXnH/by3zk/14tdhgikkNK+nmytaa+2CF0WYsXOwIRyTUlfRGRCFHSFxGJECV9EZEIUdIXEYkQJX0RkQhR0hcRiRAlfRGRCFHSFxGJECV9EZEIUdIXEYkQJX0RkQhR0hcRiRAlfRGRCFHSFxGJECV9EZEIUdIXEYkQJX0RkQhR0hcRiZCUSd/MDjOzBWa2xMzeNrOfBOWjzGy+mZWZ2ZNm1i8o7x/cLwseHxm3rJuC8tVmdlne1kpERBJK50i/AfiYu58NvB+43MwuBH4O/MrdTwGqgfFB/fFAdVD+q6AeZnYGcC1wJnA5cJ+Z9c7huuTE1x97k19MW1XsMLps0utr+fKD84sdRrfytT8t4pcvrS52GG121NZz7q0zKKvYU+xQ0vK3JVv5+F2v0aKLK4dayqTvMXXB3b7BnwMfA54OyicDVwe3xwb3CR6/xMwsKH/C3RvcfT1QBpyfi5XIpeeXbuO+V9cWO4wuu33qKt5Ys7PYYXQrU5dt5zevlBU7jDbTlm9n194DPDK3vNihpOU7U5ZQVlFHY0tLsUORTqTVpm9mvc3sLaACmAGsBXa7e1NQZTNwQnD7BGATQPB4DXBMfHmC54iISAGklfTdvdnd3w8MJ3Z0fnq+AjKzCWZWamallZWV+XoZEZFIymj0jrvvBmYBHwAGm1mf4KHhwJbg9hbgRIDg8aOAqvjyBM+Jf41J7j7G3ceUlJRkEp5Ij+RqIpccSmf0TomZDQ5uHw58AlhJLPlfE1QbBzwb3H4uuE/w+Cvu7kH5tcHonlHAaGBBjtZDpMcxK3YE0hVVdQ3MWLGj2GEcok/qKhwPTA5G2vQCprj782a2AnjCzH4KLAYeDOo/CPzRzMqAXcRG7ODub5vZFGAF0ATc4O7NuV0dESkWRz9J4n3l4YUs21LD8p9cxsD+6aTawkgZibsvBc5JUL6OBKNv3L0e+EySZd0G3JZ5mCLR1d2SqaGfKAAbqvYC0ByyIaw6I1ckpJQ6JR+U9EVEIkRJP0SaW5wr7n6D6W9vL3YoBTmr8vEFG/nC7+fl/XWk6z73u7lMWbgpdUUJPSX9ENl7oImV22r57pQlxQ6Fhqb8n1V50zPLmLO2Ku+v092FYcjm/PW7+P6flxY7DMkBJX2RsNKYze4tBDvrRJT0RSQnwvCLJExa346w7buV9EVCrrvl0rAluWIL29uhpC8SUmFLFtIzKOmLiESIkr6ISB54SDs5lPRFQi6kuUPSZCHr5FDSl4RCtp1GUnf7DLRv6h6U9EUkp7rZvipvwroTVNIPEf2Ml8S0YXRnYdsJKumHUdi2EikKTVEs+aCkLyI5pd8lMWH95a6kLyI5od8liYWtQ15JXyTkwnrE2FE3CTPylPQlobAdnUSRPgPJByV9EZE8COu1jZX0RUTyKGyjsFImfTM70cxmmdkKM3vbzL4ZlB9tZjPMbE3wf0hQbmZ2j5mVmdlSMzs3blnjgvprzGxc/lZLpOfoLm360l5YP7d0jvSbgO+4+xnAhcANZnYGcCMw091HAzOD+wBXAKODvwnARIjtJICbgQuA84GbW3cUInKocB0fSrbC1jeTMum7+zZ3fzO4vQdYCZwAjAUmB9UmA1cHt8cCj3jMPGCwmR0PXAbMcPdd7l4NzAAuz+XKiEjxhHVWSWkvozZ9MxsJnAPMB4a5+7bgoe3AsOD2CcCmuKdtDsqSlUsrfWckgbB2CEr3lHbSN7OBwJ+Bb7l7bfxjHtvF52TLNLMJZlZqZqWVlZW5WGS3E7Jfg1IkYWsWkJ4hraRvZn2JJfw/ufszQfGOoNmG4H9FUL4FODHu6cODsmTl7bj7JHcf4+5jSkpKMlkXyaGwjTgQ6W7C+vssndE7BjwIrHT3u+Ieeg5oHYEzDng2rvy6YBTPhUBN0Aw0HbjUzIYEHbiXBmUiIj1PkPXD9outTxp1LgK+DCwzs7eCsh8AdwBTzGw8UA58NnhsKnAlUAbsA64HcPddZnYrsDCod4u778rFSoj0ZOoflVxKmfTdfTbJm5kvSVDfgRuSLOsh4KFMAhSJqu7WxKZ9U/egM3IlobD9JBWR3FDSl27t0xPn8KO/Lit2GHnV1SPoq347m5E3vkBFbT0X3j6Tia+uzUlc0j0p6Uu3tqi8mkfnbSx2GPmRo19bSzfXADBv/S6219bz82mrcrNg6VRYz69Q0u+CnXUNrKusK3YYeaHWnZ5HZ8x2zY7aejbt2pd2/da3O2x9M+mM3pEkPvTzV6hvbGHDHZ8sdigiBdXS4jS2tNC/T+9ih1IwF9w+EyDj73vY+sd0pJ+Gij317DvQdEh5fWNLTl8nrD8Hk2loaubjd73GnLKdxQ4Fd2djVfpHYZ2pqK2nvrGZO6ev4qZnit9fkOgAvbnFWZvDX5lVdQ3sqW9MWa/1Pf7uU0s47UfT2j2mHxLdQ49P+i0tzrenvMWSTbuzXsb5t83k6nv/nrugUrCwHRokUV61j7KKOm5+7u1ih8LjCzZx8Z2zWFTe9VM/zr99Jtc9uIB7Z63l8QXF6y/obCv49cvvcMkvX6OsIv3E39JJVj7vpy9z8S9mpVzGxXfO4uUVO3hm8SEn00s30eOTfmVdA8+8uYV/e6S0S8t5Z0d2R1XVew906XUlpr6xOeGvrVaLN1YDsLZiL7PX7OTvXfz1sWBDuM8bLN0QW9+K2vqcLbN6X+ojfYCV22pTV5LQ6vFJv5C++oeFXHH3G+3Kzrl1Bm8GCakQKmrrmb+uql3ZvgNNjLzxBaYs3JTkWV2zpqIurQ7tnXUNzFmbXTL+x1/M4oz/nc6BpsRNaq0/jhznSw/O54sPzG/3eFNzC9OWb8tLZ2Zzi9PSkvvltv7i66zZL5NXTbbqSzfvzmApqV+zuzbzvP5OJTVp7vjSEda3QUk/h15ZVZHwKGjF1sIdGf3Tb2bzuUnz2pXtqG0A4L5Xy3L6WvFf7o/98rWU9T/3u7l84ffzU9ZLpHJPbB1+OWN1wsdbR0gky70TX13Lfzz6JtPf3pHV63fm3T+YylX3zs75ctuadxKsUzYtgMnem6t+m5umy27SKpnQ7n0HuO6hBfz7o11rEUgkbG+Lkn4BFHKPXxEkxzBaW7m3y8vYUr0/YXlrwknW8bq1Jva8XXlqblu+pThNHsU4qg7Lkfyq7bU8vWhzTpZ1oDn2C7KsouvbaKuwDpFV0g+8vGIH33xicdv92vpGPnv/3IzG5SYV0g9/Y9U+FpUnbnpKpzO5GEd2yd7JQsayrWY/10ycU7D+mkSdpvHNWenKVRJK9pqdLX5nXQPXTJzTrg+iucUZ/4eFLFh/sP/kqdJN/DjNgQGX//oNvvvUkvSCTlv+munCQkk/8K+PlPLsW1vb7k9btp0FG3Zxz8w1XV52OFN+bCTGpyfOKXYYOdL5FyuX+91Jr6+jtLy6qCNYsjnhJ9u3oOPOIpv38vH5Gyktr+aRueVtZVV1DcxcVcENj73ZVva9p5fyhzkbsow0e63vZ0iPz3IqMkm/Yk8DD81e37YB/3nRZjZX52ZcdyqF2JC27N7PU6Xpd9Tu3tf5UWp8Snls/sa2NnWINZG8s2NPcb4gSV4z3YOpbA669h9obh9C25mW6dvb0NRuHHz13gM0NDUfUq+2vrFtlFI6sWbyGTyRo+GnhfzYl2zazazVFakrdlHIDsbzKjJJH+CW51cwu2wnTc0tfOepJVwzcW7Sut3tRKnP/W4u33t6aad14tfoU/elf4T/g78s42t/WtR2/4q7X+fSX72eaYh5leo725Ud1F3JOo8zSBTvv+Ul3vvjl9run3PrDL784IJD6r3vxy9x0R2vpFxeNknqzY27M39SkY299+9c//DC1BVzJJff+rBmkB6f9Dt+2ePPoq3Yk3qMcy6OAArRoVOZYQfuup2ZdVhVxbVft44GKoZkO+NUn1Pr87L5OPfUtz8/oPXzzGRZjc2Hxh3flh0v3fHyUJjEcsjm2wPbQPJ5oB+2HxGRm3vH3dPqWMnldp3uovL9XUq21iu31VJWUce2mv0YxiffdzzHHXlYyuX97rXCT9Gb7D1Kt407JzvxtmXl9+scpSaHQnl1dQX1jc1U7Gngug+MPOTxsI64yaXoJX0OJr90Pt5czJCX6XZU6C97xxPKnizdxEvfurh9pQTrEKZT8Qv5nrW16YcgKRcjSXXntPiVuKaii0eXMHLoACA/O/Cw7j96fPNOR+5xw906+VDy0ba3dPNuxv52NvWNzcxZu5PP3D+Hpub0Jm2bv66K55duTfp4LuNNNBQxLNtv8iP97J6XjicWbuJ7cUMDW7Jo3sm1Qg4DzGXrTpj6ypoSnK2Wj+jCcHAQr0cn/cbmFr7/5847N1PJZZv+T/62giWba1i+pYZvP7mEhRuqqaxLr338c5Pm8fXHFqeu2FkcOa4XJqmSYFuTTJap+qm4k4Da3p8QfJuLM4Aq81cNwVt1iPiY2n79d8eNP0Mpk76ZPWRmFWa2PK7saDObYWZrgv9DgnIzs3vMrMzMlprZuXHPGRfUX2Nm4/KzOu29saaS19+p7FAajk81nV8bGS0vy8cSSdRkEJa2zmw7cg9WzEEMWQzZzEZnO6jOpmjIt5BsCjl18PvYA1eug3SO9P8AXN6h7EZgpruPBmYG9wGuAEYHfxOAiRDbSQA3AxcA5wM3t+4oCi3WvFPgjtwOy3Kgl7XOFZP/jSzTV+iOm31hr04UNO8U8ei1sH0YudsiwpRTe8W9iWG7ulU+pUz67v460HFs2VhgcnB7MnB1XPkjHjMPGGxmxwOXATPcfZe7VwMzOHRHkjONzS389PkV1O4/dCreH/xlGdOWb097Wcm+XMu31LTdTjaVQas1FXtiy0rw2KLyav44r5ym5hZun7oyrZjmrq3ivFtn8IXfH5xYLZ3vUlXdAR6cvT7l6yT6Ym6o2sfehiY+cVfyidUa4/onlmzazYwViSc3a2nxdtdpbU4yE9ik19ceMoFd0jb9VEM2g+fdn4OLgnd2GbyNVfu4a8Y7BTtivG3qyrT7hdKxqHwXj84rZ/66Kh5fsJHqvQc45YcvtqtzX9x7OPLGF7h3Vhkjb3yhrSz+81yxtZYf/GUZ98yMTfY3a3Ulzy1J3jeVD+uTDE+OXV+5vF1ZbX0TD/99fcav8VTpJv66eAt3vLgqLzOu5lK2o3eGufu24PZ2YFhw+wQg/rTQzUFZsvK8+NuSrTwwe33CYYfV+xr5j0cXJXhWe6naLf/pNwdnVfz0xDmdXkJtSulmfnHN2QeX7dAr2N1+84m3ACgZ2K9du3FnPh8k+zlrq1LUbK+uoYlbn1+R0XPiTXx1LWs6uWjH1GXbGPv+2Mc6NrjoTKL3ZdHGaibGJY7ZZTv58Kklh9S7feoqetkq1v0s9eXpUnbkBp9npucnJFxWJ6N3vjp5IWUVdVxz7nBGHHNEl16nsx1Z60NlFXVMf3sHn3zf8Rkte9Lraxkz8mjOHdH+B/enO5yweOf0xCemdVZn1uoKrnxvLJ4r72k/Mmzltlr+3+OLuersd2UUb7amLtvG1/70Jg9cN4aPnzGs3WOt8/Z86cKT2m1AP/nbCq6/aFRGrxN/YuTFo4fywVOGZh90nnW5I9djhzQ527WZ2QQzKzWz0srKju3x6WntlW/MyRFQ7n72te84ar/cRCMJiiXZUWpziqPXpgQnICXS8Uiosyaujm9LGN6lzk70qm+MTa1QyOaXVJ9LIrdPXZXWWdmdXbgmmUI0Wabr7a2xX+SrthduFtRsPo9Cyjbp7wiabQj+t06OsQU4Ma7e8KAsWfkh3H2Su49x9zElJYce/aUjF9+3fH5usRPEcvt6Oe3ITVLeK0+JLBeL7ZUquDz00fRKkNkL9X0v1JDNbNYnjO3jqdYjl29nyHN+1kn/OaB1BM444Nm48uuCUTwXAjVBM9B04FIzGxJ04F4alOXFwSsOZe9Hf40NVsrlNVLjvwwdE0aYjo6STmoWgi9ztuP0cxpDJy/a+isp5U4oJL78YHYXtelMLhLob2auaddPkHUspJcLcvlpdXytsE2tnLJN38weBz4CDDWzzcRG4dwBTDGz8UA58Nmg+lTgSqAM2AdcD+Duu8zsVqD1dLhb3D1vFyFtfYvTuWDGN59YzODD+6a13K2797O9tp7+fQ7dV8Z3CKVzWcKOnUsdZ3JMpmPTy/x1VZx94uCkG/W9s8qyunBIou002zxWs7+RQf0PbmpvrGl/ycRcfCkaklxGMR86G7LZ0vYrIL1lrd+5l1HBWaEdvbVpd9vt196p5JRjB1Lf2Myg/n3YujvxxWQy1fGz6Cib9/X7Ty9l+ZYa6hqSNw3915Nvtc3nVLmngW9PeYu+vQ5+r34545129V9ZtYOjDu/H0QP68cqqCq7/4Mh2jy8qr+a04wYxsH/7lNbY0tL2+P0FmjYk3U78uoYm6hub6W3G4CP6FmznkDLpu/vnkzx0SYK6DtyQZDkPAQ9lFF2WMnnv4ufQT+WDncx+2PrLAEh4Qlj8l3TmqkOnir0x7opPnYXf8UpBn5s0j8+fPyJp/XQ64jpKtslms1HuqW/k7J+8xISLT24r++2srly2MXF0qeZgz+0Z1q1DNg99Pw6erZvee/XR/3s16SCAh/++oe32uIcOnZEzrOoamtqN8EnkLx2m8Hjmzc6n9PjqH9pfxjB+QEJtfSOfnjiHD59awuSvnt+u3u9eWwfEdpqvHXLOzkG5TLjpbmtn3XywseMnV53JuA47snzpkWfkhuzXFBCb777VtprUs3smsynB5QLX7NiT9fISSXakks37WhvMUPm3Ag/T6yinQyiTHOlX7Klnb0P6c+HHa2puyfoqbTm5uls31tp53tppW3RZbGqFuGZAqx6Z9EMrXx2hlt82yVaJOi5TaW3myFWfRRi6PtqmdOjwdpx/20z2Hshu9M4dL67iH38xK6t47py+mvKq3F3btfvK/luQ2+9P5htpNt+tbPXIpB+GxJA3CVbOzHLbfJFkYdm06R888zh5nRD+MOtU23z6RtuRfUeZdnr/PcNzLjrK9HoKPUoONv5ij94pZL+/kn4B5etzzfUGk3x+m/QmNWv/nOCxHH0m2S4ml5tES1vzjnHmzYkHoWX6mXT1Mwxjk2ahJPvllYlcjkzL7pSbwn2APXI+/TDm/Dunr87qiksdJUysBdpgsvlStR7pd9amXoiElY+L4nQWd1OLc/3DC9iyez8//uczUy6z48/7uRke+YdtWGAhFWoCvFz56+It7aZxAR3pd1kYZ8pbsH4XCzfE5ujpyvcz0ar1yvGnmGyYXqp2x+8+teSQsdWZTiz35MKNjLzxBXbUJu7szvVnW5HkdTqTTqf08i01zFpdyTs76vjCA6nHwnd8a7/ycGajdQrZJhxWuX4L7ns1NqdQsia8ZFJto9968i0emN1+fp9Cfnw9M+kXO4ACy/WRfi7b9Fufku5P3j8vig3dSzZJVvxiSjekf6pH/PPip4FI9jrJLN28u+12Lo+uu7qsB2evb5vvSGK6MvFZfWMzj86NnXtTvS+z81yyedXWc2mamlvyftDaM5N+CI/043WteefQdZtdtpMDOT45KdFbmNUp+WmM3onfaSWa1ybZl/ea++cmLE8lfsK9TPPCVb89mFg7nfqii236mT7/b0u2siTuZK4oif9OjLzxBX4VnNg1Ls1fS80tfsj36vT/mcaetuG3mX0Y2eSfhRuq2V5Tzyk/fJHHcjgLQCI9NOkXO4LccHeeKt3U7szG37+e+bSv2fjZi4dOv/yzF1clqNm5m4KTzlJ9JgeaWnh8wcaDnaRxX7Qzbp7Wdjvbzzb+i/hS3JTPXRlK2lkuyHSxap7putaDh7tnruGC219OebZxq3f/YCo/m3rotr0nOMck47mrstykNgTDbp9dnN9zWnpkR26IJqxMKN3wSsur+d7TS3lh2ba2sgM5nDu9M79/Izc7lxeDaxekOvr5zStr+M0rB8/Ujc+B9Y1dX+dkr55sLv905LJZreP7k4t1jopEm9aO2syGsP6xw7z68Qq1P259mXxfR7iHJv1wZ/1UUz+0HuXWBUcar67OborpYknU8bW3k7mFHltQztRl7S9s85kkTTevvVOZ9kRctz6/gv/48LsZdFgfdidpl11bWUfN/kbOfNeRaS0z3vqdya8t8MzizqcV6OjNjbszfn2JaT1C3p5Fp3w6/uvJt5i37mD/Uev2d98Xz6W+sfmQefrnr9/F5Wcdl3BZnU1V3TZRZJ7Tl4W5/XvMmDFeWlqaumIHf5xXzv/EzYXT3Rw9oB9v/s8nmLlyB+MnZ77+El6t8+zkYgZJifneZadlNcdUoZw2bBBjRg7hxitO5/tPL2379ZvM6ccNYtq3Lu7Sa5rZIncfk+ixHnmk390b9fcdaGLfgaaMr4wl4adkn3vbanIz42i+rN6xh9U79vCn+el10K7antu5tDrqmR25xQ6gi+obWzjjf6fz4OzCdNqKdGePzsvvaJdi+NR9+Rt+2yOTftgvTCwi0pk3N+5O2g/VVT0y6Svli0h39/5bZuRluT0z6Svri4gk1COT/hH9ehc7BBGRUOqRSf+kYxJfc1REJOp6ZNLvOG2piIjE9Mik39CU/OxPEZEoK3jSN7PLzWy1mZWZ2Y35eI1ehbwigYhIN1LQpG9mvYF7gSuAM4DPm9kZuX6dQYf1zfUiRUR6hEIf6Z8PlLn7Onc/ADwBjC1wDCIikVXopH8CsCnu/uagLKeymTFRuq+vf/SUYocgRVb6o4/ndfkXnXJMXpdfSKGbcM3MJgATAEaMGJHVMs4dMYQ3/+cTAPTtbQw6rC/fnvIWV551PPsbmxnQvzd7G5rZUVvP2ScOZsmm3Vx+1nG8/s5OVm6rZe66Kq45bzj/fPa7OGZAP/YfaGbIgH4sKq+mvrGZYUf2Z099Ez99YSVPTLiQvr178cjcDazctodLTj+Wh+es51PnDOe04waxdfd+Tj/uSAYd1ofP/G4u/fv04oeffA+1+5uoa2ji8rOOo3Z/I4s37uaS9xzLvHVVnDx0IBffOSvhupUM6s9Prz6LX0xbxTXnncj2mv2cMORwzh91DEs376a8ah/nnTSEqroGqvfFpgx+Y81OBh/RlzUVddx+9XtZvrWGmv2NrN+5lwXrd/HlC0/io6cfy4U/m8mlZwzjhWXbGNCvDx87/ViufO/x9O1tjBo6gA1V+ygZ2J85a3dy0SlDWbalhmFHHsbwIYcDsP9AM1V7G3h3yUD21DdhBof17c26yr2cP+rotnXYU9/I715bx/S3t/PhU0sYVTKAC0YdQ/8+vTj2yP7sbWhmQ9Vehg85nJdXVPDx9xxLbX0TFbX1nPmuo9i5t4F+vXvR0NTMUYf3o2RQf/7t4pM5ol9vKvY00NTcwjED+7Nh515Wb99DyaD+HHtkfyr3NHDhycfQN3huU7Ozo7ae4486nDUVexh8eD8cZ/AR/dhZ18DKbbUMHdif2v2NNDY7v3r5Hc5615HMXFnRdlWlZD5z3nD+9R9PpqmlhfrGZo476nAG9u+De2z5VXUN9O/bm8P79uZAUwurttdyy/MrOOfEIexvbKa2vpFTjx3EuScNbltmRW0DTy7cxIINuzh/1NGs2lbLNeedyLAj+3P9RaN4a9NuNu3ax+hhA9td4Ssbf/rXC7jolKFt919esYMj+vVmUXk1HzntWO5/bS1ba/azOG5K6NOPG8SW6v3saWjiiH69OXpAP1panK999BR+9NflfOiUoazfuZd/PvtdnHbcQLburueD7z6G+sYWRg0dQMmg/uw70MTijbsp37WPy84YRv8+vRl4WB96B/1022r28+i8ck46ZgC1+xs50NxCv969GDqwP7/9wjl8/bHFDOrfh5e+fTG79h5g2+56ZpftpHcv41/OOYHRwwayYmstzy3ZyvgPjWLwEf0Y2L8Pjc0t7G9s5sjD+rJ8Sw0jjjmC8p37eHReORecfDSfOnc4ALX1jRzWpzf9+vRiT30je+qb2FFbz8klA6nd38jh/XrT1Oys3rGHl97e3jbJ2hcvGMFXPzSKgf370KeXsal6P/PXVXHqcYOYtmw7o4cN5KcvxC5cdMqxA7n2H07kCxdkl/9SKejUymb2AeDH7n5ZcP8mAHf/WaL62U6tLCISZZ1NrVzo5p2FwGgzG2Vm/YBrgecKHIOISGQVtHnH3ZvM7OvAdKA38JC7v13IGEREoqzgbfruPhWYWujXFRGRHnpGroiIJKakLyISIUr6IiIRoqQvIhIhSvoiIhFS0JOzMmVmlUB5FxYxFNiZo3ByRTGlL4xxKab0hTGuqMR0kruXJHog1Em/q8ysNNlZacWimNIXxrgUU/rCGJdiUvOOiEikKOmLiERIT0/6k4odQAKKKX1hjEsxpS+McUU+ph7dpi8iIu319CN9ERGJ0+2TfqoLrZtZfzN7Mnh8vpmNDElc3zazFWa21MxmmtlJxY4prt6nzczNLO8jCtKJycw+G7xXb5vZY/mOKZ24zGyEmc0ys8XBZ3hlnuN5yMwqzGx5ksfNzO4J4l1qZufmM54M4vpiEM8yM5tjZmcXO6a4ev9gZk1mdk0YYjKzj5jZW8F2/lregnH3bvtHbHrmtcDJQD9gCXBGhzpfA+4Pbl8LPBmSuD4KHBHc/s98x5VOTEG9QcDrwDxgTLFjAkYDi4Ehwf1jQ/L5TQL+M7h9BrAhzzFdDJwLLE/y+JXAi4ABFwLz8/0+pRnXB+M+uysKEVeqmOI+41eIzfh7TbFjAgYDK4ARwf28befd/Ug/nQutjwUmB7efBi4xMyt2XO4+y933BXfnAcOLHVPgVuDnQH2e40k3pn8D7nX3agB3rwhJXA60Xoz5KGBrPgNy99eBXZ1UGQs84jHzgMFmdnw+Y0onLnef0/rZUZjtPJ33CuAbwJ+BQmxP6cT0BeAZd98Y1M9bXN096adzofW2Ou7eBNQA+b7KcaYXgB9P7Cgtn1LGFDQJnOjuL+Q5lrRjAk4FTjWzv5vZPDO7PCRx/Rj4kpltJna0+I0CxNWZTLe5YijEdp6SmZ0A/AswsdixxDkVGGJmr5rZIjO7Ll8vFLoLo0eNmX0JGAN8uMhx9ALuAr5SzDgS6EOsiecjxI4SXzez97r77mIGBXwe+IO7/zK49vMfzewsd28pclyhZGYfJZb0P1TsWIBfA//t7i35/9Gftj7AecAlwOHAXDOb5+7v5OOFurMtwIlx94cHZYnqbDazPsR+ileFIC7M7OPAD4EPu3tDkWMaBJwFvBp8EY4DnjOzq9w9X1enT+d92kysHbgRWG9m7xDbCSzMU0zpxjUeuBzA3eea2WHE5lApSHNBAmltc8VgZu8DHgCucPd8f/fSMQZ4ItjOhwJXmlmTu/+1iDFtBqrcfS+w18xeB84Gcp70897Rk+fOkT7AOmAUBzvczuxQ5wbad+ROCUlc5xDrLBwdlveqQ/1XyX9Hbjrv0+XA5OD2UGJNGMeEIK4Xga8Et99DrE3f8hzXSJJ3BH6S9h25CwqxXaUR1wigDPhgoeJJFVOHen+gAB25abxP7wFmBtveEcBy4Kx8xNGtj/Q9yYXWzewWoNTdnwMeJPbTu4xYR8q1IYnrTmAg8FRwxLHR3a8qckwFlWZM04FLzWwF0Ax8z/N8tJhmXN8Bfm9m/0WsU/crHnx788HMHifWxDU06Ee4GegbxHs/sX6FK4kl2H3A9fmKJcO4/pdYH9p9wXbe5HmeXCyNmAouVUzuvtLMpgFLgRbgAXfvdMhp1rHkcTsVEZGQ6e6jd0REJANK+iIiEaKkLyISIUr6IiIRoqQvIhIS6U4WF1c/48kINXpHRCQkzOxioI7YPEpnpag7GpgCfMzdq83sWE9jzh4d6YuIhIQnmJjNzN5tZtOCOXneMLPTg4eymoxQSV9EJNwmAd9w9/OA7wL3BeVZTUbYrc/IFRHpycxsILFrErSeuQ/QP/if1WSESvoiIuHVC9jt7u9P8FhWkxGqeUdEJKTcvZZYQv8MtF0Ws/WSk38ldpSPmQ0l1tyzLtUylfRFREIimJhtLnCamW02s/HAF4HxZrYEeJuDV3GbDlQFkxHOIs3JCDVkU0QkQnSkLyISIUr6IiIRoqQvIhIhSvoiIhGipC8iEiFK+iIiEaKkLyISIUr6IiIR8v8Bhwm8q0Q0/foAAAAASUVORK5CYII=\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(user_click_merge['words_count'].values)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击的新闻类型的偏好\n",
- "\n",
- "此特征可以用于度量用户的兴趣是否广泛。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 48,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 48,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUlUlEQVR4nO3dfZBc1Xnn8e8zM3pBaCwkNBJCAiQbsKwEy8CYwoEihTG2wXGwY5dDditWHGrZsp3EjpNdw9q1dtXGu3YqNvFWsomJIaESyoGAMSQFwRhjezeJJY+MAAsEEuJFEnoZAXpBGAlJZ//oK2UkzfRtzfR097nz/VRNze3Tt/s+Z27rp9unT98bKSUkSfnrancBkqTmMNAlqSIMdEmqCANdkirCQJekiuhp5cZmz56dFi5c2MpNSlL2Vq5cuT2l1Fe2XksDfeHChQwMDLRyk5KUvYh4rpH1HHKRpIow0CWpIgx0SaoIA12SKsJAl6SKMNAlqSIMdEmqiCwC/a6HN/J3P25oGqYkTVhZBPo9q17g9oEN7S5DkjpaFoEuSSpnoEtSRWQT6F4pT5LqyyLQI6LdJUhSx8si0CVJ5Qx0SaqIbAI94SC6JNWTRaA7gi5J5bIIdElSOQNdkioim0B3Hrok1ZdFoDsNXZLKZRHokqRy2QS6Qy6SVF8mge6YiySVySTQJUllDHRJqohsAt0hdEmqL4tAd9qiJJXLItAlSeUMdEmqiGwCPTkRXZLqyiLQHUKXpHJZBLokqZyBLkkVYaBLUkVkEejOQ5ekclkEuiSpXEOBHhG/HxGrI+JnEfGtiJgaEYsiYnlErIuI2yJi8ngXK0kaWWmgR8R84PeA/pTSLwLdwNXAV4AbUkpnAi8D14xnoU5Dl6T6Gh1y6QFOiIgeYBqwGXgncEdx/y3AB5peXSGciS5JpUoDPaW0CfgT4HlqQb4TWAnsSCntL1bbCMwf7vERcW1EDETEwODgYHOqliQdo5Ehl5nAVcAi4FTgROC9jW4gpXRjSqk/pdTf19c36kIlSfU1MuTyLuCZlNJgSul14NvARcBJxRAMwAJg0zjVCEDyjOiSVFcjgf48cGFETIuIAC4DHgceAj5crLMMuHt8SnQeuiQ1opEx9OXUPvz8KfBY8Zgbgc8Cn4mIdcDJwE3jWKckqURP+SqQUvoC8IWjmtcDFzS9IknSqGTzTVHnoUtSfVkEumPoklQui0CXJJUz0CWpIrIJdIfQJam+LALdc7lIUrksAl2SVC6bQE/OW5SkuvIIdEdcJKlUHoEuSSploEtSRWQT6I6gS1J9WQS6Q+iSVC6LQJcklTPQJaki8gl0B9Elqa4sAj08f64klcoi0CVJ5Qx0SaqIbALdIXRJqi+LQHcEXZLKZRHokqRyBrokVUQ2ge750CWpviwC3WnoklQui0CXJJUz0CWpIrIJdEfQJam+LALdIXRJKpdFoEuSyhnoklQR2QS609Alqb4sAt3zoUtSuYYCPSJOiog7ImJNRDwREe+IiFkR8UBErC1+zxzvYiVJI2v0CP3rwD+nlBYDS4EngOuAB1NKZwEPFrclSW1SGugRMQO4BLgJIKW0L6W0A7gKuKVY7RbgA+NTYk1yJrok1dXIEfoiYBD464h4OCK+GREnAnNTSpuLdbYAc4d7cERcGxEDETEwODg4qiIdQZekco0Eeg9wHvAXKaVzgT0cNbySaqdCHPYQOqV0Y0qpP6XU39fXN9Z6JUkjaCTQNwIbU0rLi9t3UAv4rRExD6D4vW18Sqxx2qIk1Vca6CmlLcCGiHhz0XQZ8DhwD7CsaFsG3D0uFYJjLpLUgJ4G1/td4NaImAysBz5G7T+D2yPiGuA54CPjU6IkqRENBXpKaRXQP8xdlzW1GknSqGXxTVFwDF2SymQR6OEguiSVyiLQJUnlDHRJqggDXZIqIotA9+y5klQui0CXJJUz0CWpIrIJ9OREdEmqK4tAdwhdksplEeiSpHIGuiRVRDaB7gi6JNWXRaA7D12SymUR6JKkcga6JFVENoHuNHRJqi+LQPd86JJULotAlySVM9AlqSKyCfTkTHRJqiuLQHceuiSVyyLQJUnlsgl0py1KUn1ZBLpDLpJULotAlySVM9AlqSKyCXSH0CWpvkwC3UF0SSqTSaBLkspkE+hOW5Sk+rII9NcPHGT7K3vbXYYkdbQsAv3nrx+gd2pPu8uQpI7WcKBHRHdEPBwR/1TcXhQRyyNiXUTcFhGTx6vIOb1TnOYiSSWO5wj9U8ATQ25/BbghpXQm8DJwTTMLG6o7ggMOoktSXQ0FekQsAN4HfLO4HcA7gTuKVW4BPjAO9QHQ3RUcOGigS1I9jR6h/ynwX4GDxe2TgR0ppf3F7Y3A/OEeGBHXRsRARAwMDg6Orsiu4KBH6JJUV2mgR8SvANtSSitHs4GU0o0ppf6UUn9fX99onqI25OIRuiTV1cjUkYuAX42IK4GpwBuArwMnRURPcZS+ANg0XkV2BZjnklRf6RF6Sun6lNKClNJC4Grg+yml/wg8BHy4WG0ZcPe4FdlV++r/QVNdkkY0lnnonwU+ExHrqI2p39Scko7VXZwQ3ZkukjSy4/q2TkrpB8APiuX1wAXNL+lYh47QDxxMTOpuxRYlKT9ZfFN0589fB2Dv/oMla0rSxJVFoJ86YyqAM10kqY4sAr2nu1bm/oMeoUvSSPII9GIMff8Bj9AlaSR5BPqhI3QDXZJGlEegHzpCd8hFkkaURaDvO1AL8hf37GtzJZLUubII9FNnnAD4TVFJqieLQJ86qVbmoSN1SdKxsgj0ScWHovv8YpEkjSiLQO/prn0o+tyLr7a5EknqXFkE+uzpUwCYMimLciWpLbJIyCk9DrlIUpksAn2ygS5JpfII9OJD0Uc27mhvIZLUwbII9ENf/T8020WSdKxsEnLJvDfw1NZX2l2GJHWsbAJ9z779nDjZyxVJ0kiyCfSz5/by6Kad7S5DkjpWNoH+2usHiHYXIUkdLJtAP/f0mezdf9ATdEnSCLIJ9JRqQf7Czp+3uRJJ6kzZBPpb5r0BgMHde9tciSR1pmwCfcYJkwBYs2V3myuRpM6UTaAvPqUXgE0vO+QiScPJJtB7p9aO0Fc8+1KbK5GkzpRNoE/u6WLxKb28+Ipj6JI0nGwCHeCkaZN4enAPe/bub3cpktRxsgr0S87uA+ClPfvaXIkkdZ6sAv3MvukA3PaTDW2uRJI6T1aB/stvrh2hv+KQiyQdI6tAn9LTTV/vFP7mX5/l1X2GuiQNlVWgA1z0ppMBvzEqSUcrDfSIOC0iHoqIxyNidUR8qmifFREPRMTa4vfM8S8XrjhnHgB/8t2nWrE5ScpGI0fo+4E/SCktAS4EPhkRS4DrgAdTSmcBDxa3x92Fb6wdoT+7fU8rNidJ2SgN9JTS5pTST4vl3cATwHzgKuCWYrVbgA+MU41HmHHCJN6/9FQe27STex/b3IpNSlIWjmsMPSIWAucCy4G5KaVDiboFmDvCY66NiIGIGBgcHBxLrYf92rnzAfju6i1NeT5JqoKGAz0ipgN3Ap9OKe0ael+qnax82CtPpJRuTCn1p5T6+/r6xlTsIZcunsPiU3r5zqoXWPGM53aRJGgw0CNiErUwvzWl9O2ieWtEzCvunwdsG58Sh/f+pacCcNfDm1q5WUnqWI3McgngJuCJlNLXhtx1D7CsWF4G3N388kb2yUvPZOHJ01i+/kV+8GRL/y+RpI7UyBH6RcBvAu+MiFXFz5XAl4HLI2It8K7idktddOZsNrz8Kv/noadbvWlJ6jg9ZSuklP4fECPcfVlzyzk+X/rgOWzbvZdHNuzgtp88z0f6T6P2hkKSJp7svil6tCXz3sC23Xv57J2P8cLO19pdjiS1TfaB/vuXn82f/YdzAbhz5UbWbNlV8ghJqqbsAx3gjFknAvC1B57iD25/pM3VSFJ7VCLQz1kwg4HPv4srfvEUtu56jR89NcimHV5MWtLEUolAB5g9fQoLZ5/I9lf28dGbV/Cfbhlod0mS1FKVCXSAT112Fnd+/Je4fMlctu56jSe37Gb94CvUvsgqSdVWqUCfOqmb88+Yydlzp/Pinn28509/xDu/+kP+8VFP4iWp+krnoefo2kvexDnzZ/Da6wf59G2reGZwDzte3UcQzJg2qd3lSdK4iFYOR/T396eBgdaNbaeUOPvz9/H6gX/v4/VXLOY///KbWlaDJI1VRKxMKfWXrVfJI/RDIoIbP9p/+GIYNzzwFOsHvTCGpGqqdKADXPrmOfDm2vKty5/nH1Zu4DuramdojIAvvv8XuPqC09tYoSQ1R+UDfajPXfkWfvzMi4dv/92/PccjG3dy9QVtLEqSmmRCBfqli+dw6eI5h28/sHord6/axL+s2364racr+J+/ds7ha5dKUi4mVKAf7ROXnnlEmKeU+M6qFxh49iUDXVJ2Kj3LZTTO/tx9zJ4+mQUzpx3R3tUF/+U9izn/jJltqkzSRNXoLJdKfbGoGX7rooWccfKJdHfFET8/Xv8SP/TKSJI62IQechnOf7vyLcO2n/OF+/mnxzazfvvw0x7n9E7l8+97C11dXmBDUnsY6A1631vnseLZl3h887HnW9/92n4Gd+/lYxct5LRZ04Z5tCSNPwO9QV/+0FtHvO/exzbziVt/yg3fe4qZ0ybXfZ63LpjBVW+b3+zyJMlAb4az5/Yye/oUvrt6a9319u4/QO/USQa6pHFhoDfBmXOmM/D5d5Wu9+X71vDN/7uev/rR+oafu6sr+NWlp9LXO2UsJUqaAAz0FjprznT2H0x86d4njutxr+7dz+9edtY4VSWpKgz0FvrQ+Qu44pxTOHgcU//f/kff4+ENO7i7OP/MaJw+axrnnu78eanqDPQWmzb5+P7k82eewPfXbOP7a0Y/B35KTxdr/sd7iXBKpVRlBnqHu+sTv8S23XtH/fjbBzbwjR+u50drtzOlp3nfI+vuCpYuOInJTXxOSWNjoHe43qmT6J06+qssLT6lF4BlN69oVkmH/fdfWcJvX7yo6c8raXQM9Ip7/1tP5bSZ09h34GBTn3fZzStYu+2VwxcPaYepk7o5ZcbUtm1f6jQGesX1dHfRv3BW05931omT+daK5/nWiueb/tzH486Pv4Pzz2h+/6QcGegalZuWvZ2123a3bfvbdu3lf923hme2v8qSeTPaVsfRpk7q8sNntY2nz1WWtu1+jQu+9GC7yzjGh85bwFc/srTdZahivEi0Km1O71Ru+PWlbN01+hlAzXbHyo08tbV971okA13Z+uC5C9pdwhEef2EX//joC5zzxfvbXcqE9JnLz+ZjF03sWVcGutQk11y8iJOn1z/bpsbHXQ9vYuC5lw30sTw4It4LfB3oBr6ZUvpyU6qSMrT0tJNYetpJ7S5jQlr53Mv8YM02Lv/aD9tdyohuWvZ2Tj95fK+XMOpAj4hu4M+By4GNwE8i4p6U0uPNKk6SGnHNxYu4f/WWdpdRVyu+VT2WI/QLgHUppfUAEfH3wFWAgS6ppa5623yvM8DYLhI9H9gw5PbGou0IEXFtRAxExMDg4OAYNidJqmfc3wOklG5MKfWnlPr7+vrGe3OSNGGNJdA3AacNub2gaJMktcFYAv0nwFkRsSgiJgNXA/c0pyxJ0vEa9YeiKaX9EfE7wP3Upi3enFJa3bTKJEnHZUzz0FNK9wL3NqkWSdIYeLkZSaoIA12SKqKlp8+NiEHguVE+fDawvYnl5MA+Twz2ufrG2t8zUkql875bGuhjEREDjZwPuErs88Rgn6uvVf11yEWSKsJAl6SKyCnQb2x3AW1gnycG+1x9LelvNmPokqT6cjpClyTVYaBLUkVkEegR8d6IeDIi1kXEde2u53hFxLMR8VhErIqIgaJtVkQ8EBFri98zi/aIiP9d9PXRiDhvyPMsK9ZfGxHLhrSfXzz/uuKx0YY+3hwR2yLiZ0Paxr2PI22jjX3+YkRsKvb1qoi4csh91xf1PxkR7xnSPuzruzjx3fKi/bbiJHhExJTi9rri/oUt6u9pEfFQRDweEasj4lNFe2X3c50+d+Z+Til19A+1E389DbwRmAw8Aixpd13H2YdngdlHtf0xcF2xfB3wlWL5SuA+IIALgeVF+yxgffF7ZrE8s7hvRbFuFI+9og19vAQ4D/hZK/s40jba2OcvAn84zLpLitfuFGBR8Zrurvf6Bm4Hri6W/xL4eLH8CeAvi+Wrgdta1N95wHnFci/wVNGvyu7nOn3uyP3c0n/0o/yDvgO4f8jt64Hr213XcfbhWY4N9CeBeUNeNE8Wy98AfuPo9YDfAL4xpP0bRds8YM2Q9iPWa3E/F3JkuI17H0faRhv7PNI/9CNet9TOUvqOkV7fRaBtB3qK9sPrHXpssdxTrBdt2N93U7umcOX38zB97sj9nMOQS0OXuutwCfhuRKyMiGuLtrkppc3F8hZgbrE8Un/rtW8cpr0TtKKPI22jnX6nGGK4ecjQwPH2+WRgR0pp/1HtRzxXcf/OYv2WKd7+nwssZ4Ls56P6DB24n3MI9Cq4OKV0HnAF8MmIuGTonan2X3Cl54+2oo8d8nf8C+BNwNuAzcBX21rNOIiI6cCdwKdTSruG3lfV/TxMnztyP+cQ6Nlf6i6ltKn4vQ24C7gA2BoR8wCK39uK1Ufqb732BcO0d4JW9HGkbbRFSmlrSulASukg8FfU9jUcf59fBE6KiJ6j2o94ruL+GcX64y4iJlELtltTSt8umiu9n4frc6fu5xwCPetL3UXEiRHRe2gZeDfwM2p9OPTp/jJqY3MU7R8tZghcCOws3mreD7w7ImYWb+/eTW2sbTOwKyIuLGYEfHTIc7VbK/o40jba4lDoFD5IbV9Drc6ri5kLi4CzqH0AOOzruzgKfQj4cPH4o/9+h/r8YeD7xfrjqvjb3wQ8kVL62pC7KrufR+pzx+7ndnywMIoPIq6k9uny08Dn2l3Pcdb+RmqfaD8CrD5UP7WxsAeBtcD3gFlFewB/XvT1MaB/yHP9NrCu+PnYkPb+4gX1NPBntOcDsm9Re+v5OrVxwGta0ceRttHGPv9t0adHi3+Q84as/7mi/icZMhNppNd38dpZUfwt/gGYUrRPLW6vK+5/Y4v6ezG1oY5HgVXFz5VV3s91+tyR+9mv/ktSReQw5CJJaoCBLkkVYaBLUkUY6JJUEQa6JFWEgS5JFWGgS1JF/H85cMkmMcaqfgAAAABJRU5ErkJggg==\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 49,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " category_id \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 250000.000000 \n",
- " 250000.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 124999.500000 \n",
- " 4.573188 \n",
- " \n",
- " \n",
- " std \n",
- " 72168.927986 \n",
- " 4.419800 \n",
- " \n",
- " \n",
- " min \n",
- " 0.000000 \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 62499.750000 \n",
- " 2.000000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 124999.500000 \n",
- " 3.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 187499.250000 \n",
- " 6.000000 \n",
- " \n",
- " \n",
- " max \n",
- " 249999.000000 \n",
- " 95.000000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "#点击次数在前50的用户\n",
+ "plt.plot(user_click_item_count[:50])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "点击次数排前50的用户的点击次数都在100次以上。思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXEAAAD4CAYAAAAaT9YAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAARV0lEQVR4nO3dfYxc1X3G8eexd7ExEDAYjEPYrkOQFZekKUxT2lKgJQHHSuWGphJIDaRYWaUBKUitKJQqRWlTNYnaSFWippvaMonASZsUGSVtg4tSXKkYYqd+WQqYlwLxSzAvcYgIBYxP/5i7u6Nl786dmTt7z5n7/UjWzt6Z3fmdnfGjM+ece65DCAIApGlB1QUAALpHiANAwghxAEgYIQ4ACSPEASBhQ/P5ZMuWLQujo6Pz+ZQAkLydO3c+H0I4fbb75jXER0dHtWPHjvl8SgBInu2n8+5jOAUAEkaIA0DCCHEASBghDgAJI8QBIGFtQ9z2RtuHbU/Mct8f2g62l/WnPADAXIr0xDdJWjPzoO2zJV0u6ZmSawIAFNR2nXgIYZvt0Vnu+oKkmyRtKbuome59+Fnt/uGRfj/Nm5yy5Dh99FdHtWCB5/25AaCIrk72sb1O0oEQwm577oCzPSZpTJJGRka6eTrdt+85fW177lr3vpjcZv2SVafrnNNPnNfnBoCiOg5x20sk/YmaQylthRDGJY1LUqPR6OoKFJ9ed54+ve68bn60a9/ec1A33PnfeuMYF80AEK9uVqecI2mlpN22n5L0Nkk/sH1mmYXFggsfAYhZxz3xEMJeSWdMfp8FeSOE8HyJdVXOYhwcQPyKLDHcLOl+Sats77e9vv9lVW9yqD+IrjiAeBVZnXJ1m/tHS6smIpP9cIZTAMSMMzZzTPXECXEAESPE22A4BUDMCPFcTGwCiB8hnoPhFAApIMRz0A8HkAJCvA164gBiRojnaLcnDADEgBDPMbVOnNUpACJGiOdgYhNACgjxHNOn3QNAvAhxAEgYIZ5jchfDwHgKgIgR4nkYTgGQAEI8B7sYAkgBIZ5jep04KQ4gXoQ4ACSMEM/BcAqAFBDiOVgnDiAFhHiO6SWGFRcCAHMgxHOw/xWAFBDibXCyD4CYEeI5WGAIIAWEeB52MQSQAEI8x9TEJn1xABEjxHMwsQkgBYR4O3TEAUSMEM/BxCaAFBDiOSY3wGJiE0DMCPEcjIkDSAEhnoOr3QNIASHeBsMpAGLWNsRtb7R92PZEy7E/t73H9i7b99h+a3/LnH/sYgggBUV64pskrZlx7PMhhHeHEN4j6duSPlVyXRFgUBxA/IbaPSCEsM326IxjL7V8e4IGsMM6tKAZ4us3fV8LIp/lPGHRQm25/iKNnLak6lIAzLO2IZ7H9mckXSPpJ5J+Y47HjUkak6SRkZFun27erX7rW3TTmlX66f8drbqUOR088oq27DqoA0deIcSBGuo6xEMIt0q61fYtkm6Q9Gc5jxuXNC5JjUYjmR778MIF+sSl76i6jLa2P/mCtuw6yCoaoKbKWJ1yh6TfKeH3oAtTAz1kOFBLXYW47XNbvl0n6ZFyykG3yHCgntoOp9jeLOlSScts71dz2GSt7VWSjkl6WtLH+1kk8rE9AFBvRVanXD3L4Q19qAVdiHzhDIA+44zNxLE9AFBvhHjizGXkgFojxAcEGQ7UEyGevMmJTWIcqCNCPHFMbAL1RognjsvIAfVGiCfO7JkL1BohPiBYYgjUEyGeuKnhFDIcqCVCPHFMbAL1RognzmLvFKDOCPHEMa8J1BshPiA42QeoJ0J8QBDhQD0R4oljYhOoN0I8cUxsAvVGiCfOXGQTqDVCfEDQEwfqiRBPHEsMgXojxBNnMbMJ1BkhnjguzwbUGyGeOC6UDNQbIT4g6IkD9USIJ46JTaDeCPHkMbEJ1BkhnrjpiU364kAdEeKJox8O1BshPiDoiAP1RIgnbvJq9ywxBOqJEE8cwylAvbUNcdsbbR+2PdFy7PO2H7G9x/Zdtk/pa5XIxRmbQL0V6YlvkrRmxrGtks4LIbxb0j5Jt5RcFwpiP3Gg3obaPSCEsM326Ixj97R8u13Sh0uuCx26b99zOvLK61WX0bMVJy/W2netqLoMIBltQ7yA6yR9I+9O22OSxiRpZGSkhKdDq5OXDOvERUO6e/dB3b37YNXllGLvbZfrpMXDVZcBJKGnELd9q6Sjku7Ie0wIYVzSuCQ1Gg0+9Jfs5OOHteNP36dXjx6rupSe3fnAM/rsvz2io2/wNgGK6jrEbX9U0gclXRY4XbBSi4cXavHwwqrL6Nnxw80pGt5MQHFdhbjtNZJuknRJCOFn5ZaEurJZMAl0qsgSw82S7pe0yvZ+2+slfVHSSZK22t5l+8t9rhM1wD4wQOeKrE65epbDG/pQC2pu+gIXAIrijE1Eh444UBwhjniwDwzQMUIc0WBaE+gcIY5omEFxoGOEOKJDhgPFEeKIBpt5AZ0jxBGNqXXi9MWBwghxRIOJTaBzhDiiwQUugM4R4ogOGQ4UR4gjGtMTm8Q4UBQhjngwnAJ0jBBHNJjYBDpHiANAwghxRGPyohAMpwDFEeKIxvTWKaQ4UBQhjmiwThzoHCEOAAkjxBGN6b1TABRFiCManOwDdI4QRzToiQOdI8QRHTriQHGEOAAkjBBHNMxFNoGOEeKIxlSEk+FAYYQ4osHEJtA5QhzRoScOFEeIIxpmM1qgY4Q4osHV7oHOEeKIBhObQOcIcUSDXQyBzrUNcdsbbR+2PdFy7HdtP2T7mO1Gf0tE3TCcAhRXpCe+SdKaGccmJF0paVvZBaHOmNgEOjXU7gEhhG22R2cce1hqPcMO6N2C7O30e//wgIYWDv5I37lnnKg7P3Zh1WUgcW1DvFe2xySNSdLIyEi/nw4J++WVp+m6X1upV15/o+pS+m7vgSP6rydeqLoMDIC+h3gIYVzSuCQ1Gg0GO5Hr5CXD+tRvra66jHnxha37NHHgparLwAAY/M+sQISmV+LQr0FvCHGgQmQ4elVkieFmSfdLWmV7v+31tj9ke7+kX5H0Hdvf7XehwCCZuhRdxXUgfUVWp1ydc9ddJdcC1AYLu1AWhlOACkxvMUBfHL0hxIEKsHc6ykKIAxWiI45eEeJABSbPdmafGPSKEAeAhBHiQAXYdhdlIcSBCnApOpSFEAcqRE8cvSLEgQpwPVGUhRAHKsBgCspCiAMVYGITZSHEgQqwARbKQogDFWLvFPSKEAcqwN4pKAshDgAJI8SBCkztnUJXHD0ixIEKTC0xJMTRI0IcqBAn+6BXhDhQAdaJoyyEOFABzthEWQhxoALTF4UAekOIAxWYHk4hxtEbQhyoEBGOXhHiQAUmx8TpiKNXhDhQBTO1iXIQ4kAFpnriDKigR4Q4UAFPpzjQE0IcqBAZjl4R4kAFpi4KQYqjR4Q4UAHmNVGWtiFue6Ptw7YnWo6danur7ceyr0v7WyYwWJjYRFmK9MQ3SVoz49jNku4NIZwr6d7sewAFsQEWyjLU7gEhhG22R2ccXifp0uz27ZL+Q9Ifl1kYUAf/sveQli45ruoyorRoeIHev3q5Fg0trLqUqLUN8RzLQwiHsts/krQ874G2xySNSdLIyEiXTwcMlmUnLpIk/cV3Hq64krj9/Ucu0BU/f2bVZUSt2xCfEkIItnM/FIYQxiWNS1Kj0eDDIyDpsncu1/ZbLtNrR49VXUqUnn7xZX1kw4N6lb9PW92G+LO2V4QQDtleIelwmUUBdXDmyYurLiFar73RDG92eWyv2yWGd0u6Nrt9raQt5ZQDACzB7ESRJYabJd0vaZXt/bbXS/orSe+3/Zik92XfA0ApyPDiiqxOuTrnrstKrgUAJLVc+YjRlLY4YxNAtDgZqj1CHEB0uGhGcYQ4gOgwsVkcIQ4gOuzyWBwhDiA6U3vLVFtGEghxANHiZJ/2CHEA0SLC2yPEAUSHic3iCHEA0TGD4oUR4gCiw5WPiiPEAUSLec32CHEA0WE0pThCHEB0zD6GhRHiAKLDhaSLI8QBRIeJzeIIcQDRoifeHiEOID5MbBZGiAOIDhObxRHiAKJjrgpRGCEOIDrTE5tohxAHEC064u0R4gCiM321e1K8HUIcQHSY1iyOEAcQHfZOKY4QBxAdLpRcHCEOIFpkeHuEOID4TG2ARYy3Q4gDiA7X2CyOEAcQHTK8OEIcQHSm14lXXEgCCHEA0WI/8fZ6CnHbn7Q9Yfsh2zeWVBOAmmP/q+K6DnHb50n6mKT3SvoFSR+0/Y6yCgNQX0xsFjfUw8++U9IDIYSfSZLt+yRdKelzZRQGoL4mT/b5yn8+qW/u3F9xNeX4yyvfpV8aPbX039tLiE9I+ozt0yS9ImmtpB0zH2R7TNKYJI2MjPTwdADqYvHwAn38knP0zIsvV11KaY4fXtiX3+teFtPbXi/pE5JelvSQpFdDCDfmPb7RaIQdO96U8wCAOdjeGUJozHZfTxObIYQNIYQLQggXS/qxpH29/D4AQGd6GU6R7TNCCIdtj6g5Hn5hOWUBAIroKcQlfSsbE39d0vUhhCO9lwQAKKqnEA8h/HpZhQAAOscZmwCQMEIcABJGiANAwghxAEhYTyf7dPxk9nOSnu7yx5dJer7EclJAm+uBNtdDL23+uRDC6bPdMa8h3gvbO/LOWBpUtLkeaHM99KvNDKcAQMIIcQBIWEohPl51ARWgzfVAm+uhL21OZkwcAPBmKfXEAQAzEOIAkLAkQtz2GtuP2n7c9s1V19ML20/Z3mt7l+0d2bFTbW+1/Vj2dWl23Lb/Nmv3Htvnt/yea7PHP2b72qraMxvbG20ftj3Rcqy0Ntq+IPsbPp79bOVXZMxp8222D2Sv9S7ba1vuuyWr/1HbV7Qcn/W9bnul7Qey49+wfdz8tW52ts+2/T3b/5NdLP2T2fGBfa3naHN1r3UIIep/khZKekLS2yUdJ2m3pNVV19VDe56StGzGsc9Jujm7fbOkz2a310r6VzUv/n2hmtc0laRTJT2ZfV2a3V5addta2nOxpPMlTfSjjZIezB7r7Gc/EGmbb5P0R7M8dnX2Pl4kaWX2/l4413td0j9Kuiq7/WVJfxBBm1dIOj+7fZKaF4VZPciv9Rxtruy1TqEn/l5Jj4cQngwhvCbp65LWVVxT2dZJuj27fbuk3245/tXQtF3SKbZXSLpC0tYQwoshhB9L2ippzTzXnCuEsE3SizMOl9LG7L63hBC2h+a7/Kstv6syOW3Os07S10MIr4YQ/lfS42q+z2d9r2e9z9+U9M3s51v/fpUJIRwKIfwgu/1TSQ9LOksD/FrP0eY8fX+tUwjxsyT9sOX7/Zr7jxa7IOke2zvdvIi0JC0PIRzKbv9I0vLsdl7bU/yblNXGs7LbM4/H6oZs6GDj5LCCOm/zaZKOhBCOzjgeDdujkn5R0gOqyWs9o81SRa91CiE+aC4KIZwv6QOSrrd9ceudWY9joNd91qGNmb+TdI6k90g6JOmvK62mT2yfKOlbkm4MIbzUet+gvtaztLmy1zqFED8g6eyW79+WHUtSCOFA9vWwpLvU/Fj1bPbRUdnXw9nD89qe4t+krDYeyG7PPB6dEMKzIYQ3QgjHJH1Fzdda6rzNL6g59DA043jlbA+rGWZ3hBD+OTs80K/1bG2u8rVOIcS/L+ncbMb2OElXSbq74pq6YvsE2ydN3pZ0uaQJNdszOSN/raQt2e27JV2TzepfKOkn2cfU70q63PbS7GPb5dmxmJXSxuy+l2xfmI0fXtPyu6IyGWSZD6n5WkvNNl9le5HtlZLOVXMCb9b3etab/Z6kD2c/3/r3q0z2998g6eEQwt+03DWwr3Vemyt9rauc6S36T81Z7X1qzubeWnU9PbTj7WrOQu+W9NBkW9QcB7tX0mOS/l3SqdlxS/pS1u69khotv+s6NSdJHpf0+1W3bUY7N6v5kfJ1Ncf01pfZRkmN7D/JE5K+qOzM4wjb/LWsTXuy/8wrWh5/a1b/o2pZcZH3Xs/eOw9mf4t/krQogjZfpOZQyR5Ju7J/awf5tZ6jzZW91px2DwAJS2E4BQCQgxAHgIQR4gCQMEIcABJGiANAwghxAEgYIQ4ACft/AbwTsfQSxAYAAAAASUVORK5CYII=\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id category_id\n",
- "count 250000.000000 250000.000000\n",
- "mean 124999.500000 4.573188\n",
- "std 72168.927986 4.419800\n",
- "min 0.000000 1.000000\n",
- "25% 62499.750000 2.000000\n",
- "50% 124999.500000 3.000000\n",
- "75% 187499.250000 6.000000\n",
- "max 249999.000000 95.000000"
- ]
- },
- "execution_count": 49,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户查看文章的长度的分布\n",
- "\n",
- "通过统计不同用户点击新闻的平均字数,这个可以反映用户是对长文更感兴趣还是对短文更感兴趣。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 50,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 50,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从上图中可以发现有一小部分人看的文章平均词数非常高,也有一小部分人看的平均文章次数非常低。\n",
- "\n",
- "大多数人偏好于阅读字数在200-400字之间的新闻。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 51,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 51,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#挑出大多数人的区间仔细看看\n",
- "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以发现大多数人都是看250字以下的文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 52,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 250000.000000 \n",
- " 250000.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 124999.500000 \n",
- " 205.830189 \n",
- " \n",
- " \n",
- " std \n",
- " 72168.927986 \n",
- " 47.174030 \n",
- " \n",
- " \n",
- " min \n",
- " 0.000000 \n",
- " 8.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 62499.750000 \n",
- " 187.500000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 124999.500000 \n",
- " 202.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 187499.250000 \n",
- " 217.750000 \n",
- " \n",
- " \n",
- " max \n",
- " 249999.000000 \n",
- " 3434.500000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "#点击次数排名在[25000:50000]之间\n",
+ "plt.plot(user_click_item_count[25000:50000])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻点击次数分析"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:42:14.526476Z",
+ "start_time": "2020-11-13T15:42:14.463642Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'].count(), reverse=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:42:16.198000Z",
+ "start_time": "2020-11-13T15:42:16.044455Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 37,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAD4CAYAAAAO9oqkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAActElEQVR4nO3de5Bc5X3m8e8zNwmEQBIaFFaSI9kWTgTr2HgCStnrjZEDgrgi/iAuUdlF66isqlhOHG92bbCrQtY2teBkF5uKjVcxCsJxIRSMFyWBEC0mIZs1guEugUGDuGhkgQZGiLsuM7/947w9c6Z7Lk1fZkY5z6dqqk+/59K/PmrNM+95z+mjiMDMzIqtZaoLMDOzqecwMDMzh4GZmTkMzMwMh4GZmQFtU11ArebPnx9LliyZ6jLMzI4rDz744MsR0VneftyGwZIlS+ju7p7qMszMjiuSnh+t3YeJzMzMYWBmZg4DMzPDYWBmZlQRBpI2STogaWdZ++9L+pmkXZK+mWu/QlKPpKckXZBrX5XaeiRdnmtfKmlHar9FUkej3pyZmVWnmp7BjcCqfIOkTwCrgV+JiDOBP0vty4E1wJlpne9KapXUCnwHuBBYDlyalgW4Brg2It4PHATW1fumzMzs3ZkwDCLiXqC/rPn3gKsj4nBa5kBqXw1siYjDEfEs0AOck356ImJPRBwBtgCrJQk4D7g1rb8ZuLi+t2RmZu9WrWMGZwD/Lh3e+SdJv5raFwJ7c8v1prax2k8FXo2IY2Xto5K0XlK3pO6+vr6aCr/xX57lbx79eU3rmpn9a1VrGLQB84AVwH8Ftqa/8psqIjZGRFdEdHV2VlxAV5W/2vECd+7c3+DKzMyOb7VegdwL3BbZnXHulzQIzAf2AYtzyy1KbYzR/gowR1Jb6h3kl28KAb6fj5nZSLX2DP438AkASWcAHcDLwDZgjaQZkpYCy4D7gQeAZenMoQ6yQeZtKUzuAS5J210L3F5jTVWRHAZmZuUm7BlIuhn4dWC+pF7gSmATsCmdbnoEWJt+se+StBV4AjgGbIiIgbSdzwN3Aa3ApojYlV7iy8AWSd8AHgZuaOD7q3w/NP1olpnZcWfCMIiIS8eY9R/GWP4q4KpR2u8A7hilfQ/Z2UaTJnDXwMwsr3BXIPswkZlZpcKFAeB+gZlZmcKFgST3DMzMyhQvDAD3DczMRipeGHjMwMysQjHDYKqLMDObZooXBr7OwMysQuHCACB8nMjMbITChYEPE5mZVSpeGOABZDOzcoULAyT3DMzMyhQuDLKegePAzCyveGHgk4nMzCoULwzwmIGZWbnihYG7BmZmFQoXBuD7GZiZlZswDCRtknQg3dWsfN4fSQpJ89NzSbpOUo+kxySdnVt2raTd6Wdtrv0jkh5P61ynJv/p7sNEZmaVqukZ3AisKm+UtBg4H3gh13wh2X2PlwHrgevTsvPIbpd5Ltldza6UNDetcz3w2dx6Fa/VSP6iOjOzShOGQUTcC/SPMuta4EuMvKB3NXBTZO4D5kg6HbgA2B4R/RFxENgOrErzTo6I+9I9lG8CLq7rHU1AyIeJzMzK1DRmIGk1sC8iHi2btRDYm3vem9rGa+8dpX2s110vqVtSd19fXy2lg3sGZmYV3nUYSDoR+Arwx40vZ3wRsTEiuiKiq7Ozs6ZtCH83kZlZuVp6Bu8DlgKPSnoOWAQ8JOkXgH3A4tyyi1LbeO2LRmlvGjkNzMwqvOswiIjHI+K0iFgSEUvIDu2cHREvAtuAy9JZRSuAQxGxH7gLOF/S3DRwfD5wV5r3mqQV6Syiy4DbG/TeRuX7GZiZVarm1NKbgZ8CH5DUK2ndOIvfAewBeoC/AD4HEBH9wNeBB9LP11IbaZnvp3WeAe6s7a1UzwPIZmYjtU20QERcOsH8JbnpADaMsdwmYNMo7d3AWRPV0Sg+tdTMrFLhrkD2zW3MzCoVLwyQv8LazKxM8cLAPQMzswqFCwPwmIGZWbnChYF820szswrFC4OpLsDMbBoqXBgAPk5kZlamcGHgAWQzs0rFCwPcMTAzK1e8MJDvZ2BmVq54YYB7BmZm5YoXBv5uIjOzCoULA/B1BmZm5QoXBlnPwHFgZpZXvDCY6gLMzKahwoWBmZlVquZOZ5skHZC0M9f2p5J+JukxST+WNCc37wpJPZKeknRBrn1VauuRdHmufamkHan9FkkdDXx/o7wfDyCbmZWrpmdwI7CqrG07cFZEfBB4GrgCQNJyYA1wZlrnu5JaJbUC3wEuBJYDl6ZlAa4Bro2I9wMHgfFuq1k34esMzMzKTRgGEXEv0F/W9g8RcSw9vQ9YlKZXA1si4nBEPEt2X+Nz0k9PROyJiCPAFmC1JAHnAbem9TcDF9f3lsbnnoGZWaVGjBn8LsM3sV8I7M3N601tY7WfCryaC5ZS+6gkrZfULam7r6+vpmL93URmZpXqCgNJXwWOAT9sTDnji4iNEdEVEV2dnZ01bcO3vTQzq9RW64qS/hPwKWBlDP923Qcszi22KLUxRvsrwBxJbal3kF++OdwzMDOrUFPPQNIq4EvAb0XEW7lZ24A1kmZIWgosA+4HHgCWpTOHOsgGmbelELkHuCStvxa4vba3UmXtzdy4mdlxqppTS28Gfgp8QFKvpHXAnwOzge2SHpH0PYCI2AVsBZ4A/h7YEBED6a/+zwN3AU8CW9OyAF8G/rOkHrIxhBsa+g5H466BmdkIEx4miohLR2ke8xd2RFwFXDVK+x3AHaO07yE722hS+B7IZmaVCncFcvYV1o4DM7O84oWBB5DNzCoULwzwRWdmZuWKFwa+7aWZWYXihQHuGZiZlStcGPhCAzOzSsULA9wzMDMrV7gwkLsGZmYVihcGvgeymVmF4oUBvs7AzKxc8cLAN7cxM6tQvDDwbS/NzCoULwzcMzAzq1DIMDAzs5EKFwbgAWQzs3IFDAP5MJGZWZlq7nS2SdIBSTtzbfMkbZe0Oz3OTe2SdJ2kHkmPSTo7t87atPxuSWtz7R+R9Hha5zqpuQdysq07DczM8qrpGdwIrCpruxy4OyKWAXen5wAXkt33eBmwHrgesvAArgTOJbur2ZWlAEnLfDa3XvlrNZS/qM7MrNKEYRAR9wL9Zc2rgc1pejNwca79psjcB8yRdDpwAbA9Ivoj4iCwHViV5p0cEfdFdlnwTbltNYVvbmNmVqnWMYMFEbE/Tb8ILEjTC4G9ueV6U9t47b2jtI9K0npJ3ZK6+/r6aipcyF9HYWZWpu4B5PQX/aT8do2IjRHRFRFdnZ2dNW3DPQMzs0q1hsFL6RAP6fFAat8HLM4ttyi1jde+aJT2pvFlBmZmlWoNg21A6YygtcDtufbL0llFK4BD6XDSXcD5kuamgePzgbvSvNckrUhnEV2W21bT+CiRmdlIbRMtIOlm4NeB+ZJ6yc4KuhrYKmkd8Dzw6bT4HcBFQA/wFvAZgIjol/R14IG03NciojQo/TmyM5ZOAO5MP00jeczAzKzchGEQEZeOMWvlKMsGsGGM7WwCNo3S3g2cNVEdjeQoMDMbqXBXIMs3NDAzq1C8MEDOAjOzMsULA9/20sysQvHCAB8lMjMrV7ww8IUGZmYVChcG4OsMzMzKFS4MJN8D2cysXPHCAPcMzMzKFS4M8BfVmZlVKFwYyGlgZlaheGEgPGZgZlameGGAxwzMzMoVLwx8nYGZWYXChQF4yMDMrFzhwsD3QDYzq1S8MPDJRGZmFeoKA0lflLRL0k5JN0uaKWmppB2SeiTdIqkjLTsjPe9J85fktnNFan9K0gV1vqfxa8YDyGZm5WoOA0kLgT8AuiLiLKAVWANcA1wbEe8HDgLr0irrgIOp/dq0HJKWp/XOBFYB35XUWmtdVRTetE2bmR2v6j1M1AacIKkNOBHYD5wH3JrmbwYuTtOr03PS/JWSlNq3RMThiHiW7P7J59RZ15hKUeBxAzOzYTWHQUTsA/4MeIEsBA4BDwKvRsSxtFgvsDBNLwT2pnWPpeVPzbePss4IktZL6pbU3dfXV1PdpY6Bs8DMbFg9h4nmkv1VvxT4N8AsssM8TRMRGyOiKyK6Ojs7a9qG8GEiM7Ny9Rwm+iTwbET0RcRR4Dbgo8CcdNgIYBGwL03vAxYDpPmnAK/k20dZp2ncMTAzG1ZPGLwArJB0Yjr2vxJ4ArgHuCQtsxa4PU1vS89J838S2YH7bcCadLbRUmAZcH8ddY1r+DCR48DMrKRt4kVGFxE7JN0KPAQcAx4GNgJ/B2yR9I3UdkNa5QbgB5J6gH6yM4iIiF2StpIFyTFgQ0QM1FrXRIYGkJv1AmZmx6GawwAgIq4Erixr3sMoZwNFxDvAb4+xnauAq+qppVoeQDYzq1TAK5CzNPDXWJuZDStcGJS4Z2BmNqxwYeALkM3MKhUvDHydgZlZhcKFQYkPE5mZDStcGAydTeQBZDOzIcULg/TonoGZ2bDihcFQz8DMzEqKFwapb+CvozAzG1a8MHDPwMysQuHCoMQdAzOzYYULA/mqMzOzCoULgyHuGZiZDSlcGAx/hbXTwMyspHhh4K+wNjOrUFcYSJoj6VZJP5P0pKRfkzRP0nZJu9Pj3LSsJF0nqUfSY5LOzm1nbVp+t6S1Y79i/XxzGzOzSvX2DL4N/H1E/BLwK8CTwOXA3RGxDLg7PQe4kOyWlsuA9cD1AJLmkd0g51yym+JcWQqQZhi6n4G7BmZmQ2oOA0mnAB8n3dYyIo5ExKvAamBzWmwzcHGaXg3cFJn7gDmSTgcuALZHRH9EHAS2A6tqrWviurNHR4GZ2bB6egZLgT7gLyU9LOn7kmYBCyJif1rmRWBBml4I7M2t35vaxmqvIGm9pG5J3X19fTUV7e8mMjOrVE8YtAFnA9dHxIeBNxk+JARAZMdiGvZrNyI2RkRXRHR1dnbWthHf9tLMrEI9YdAL9EbEjvT8VrJweCkd/iE9Hkjz9wGLc+svSm1jtZuZ2SSpOQwi4kVgr6QPpKaVwBPANqB0RtBa4PY0vQ24LJ1VtAI4lA4n3QWcL2luGjg+P7U1xdD1x+4YmJkNaatz/d8HfiipA9gDfIYsYLZKWgc8D3w6LXsHcBHQA7yVliUi+iV9HXggLfe1iOivs64xeQDZzKxSXWEQEY8AXaPMWjnKsgFsGGM7m4BN9dRSrZaUBoMeQTYzG1K4K5BbfAWymVmFwoVB6aKzgUGngZlZSeHCoHXoCuQpLsTMbBopXBi0pHc84DQwMxtSvDDwALKZWYXihoHHDMzMhhQuDFrT6UQ+TGRmNqxwYdDis4nMzCoUMAyyR3cMzMyGFS4Mhg4TuWdgZjakcGHQ4jEDM7MKhQuDVt/20sysQuHCYHgAeYoLMTObRooXBqUrkD1mYGY2pHBh4MNEZmaVChcGHkA2M6tUdxhIapX0sKS/Tc+XStohqUfSLekuaEiakZ73pPlLctu4IrU/JemCemsajy86MzOr1IiewReAJ3PPrwGujYj3AweBdal9HXAwtV+blkPScmANcCawCviupNYG1DWq0nUG7hiYmQ2rKwwkLQJ+E/h+ei7gPODWtMhm4OI0vTo9J81fmZZfDWyJiMMR8SzZPZLPqaeu8ZSuQHbPwMxsWL09g28BXwJKJ2qeCrwaEcfS815gYZpeCOwFSPMPpeWH2kdZZwRJ6yV1S+ru6+urqeChw0TuGpiZDak5DCR9CjgQEQ82sJ5xRcTGiOiKiK7Ozs6atjF8mMhhYGZW0lbHuh8FfkvSRcBM4GTg28AcSW3pr/9FwL60/D5gMdArqQ04BXgl116SX6fhfNGZmVmlmnsGEXFFRCyKiCVkA8A/iYjfAe4BLkmLrQVuT9Pb0nPS/J9E9uf5NmBNOttoKbAMuL/WuibS6ttemplVqKdnMJYvA1skfQN4GLghtd8A/EBSD9BPFiBExC5JW4EngGPAhogYaEJdALSnNDh6zF0DM7OShoRBRPwj8I9peg+jnA0UEe8Avz3G+lcBVzWilol0tGVhcMTHiczMhhTuCuSO1DM44p6BmdmQwoXBjPbsejaHgZnZsMKFQalncPhY04YlzMyOO4ULg/ZWIblnYGaWV7gwkER7awtHBnxqqZlZSeHCALJDRe4ZmJkNK2QYtLeKoz611MxsSCHDoKOtxWFgZpZTyDDIxgwcBmZmJYUMA48ZmJmNVMgwaG/1YSIzs7xChkE2ZuBTS83MSgoZBu2t8mEiM7OcQobBzPZW3jnqr6MwMyspZBic2NHGm0ccBmZmJYUMg1kzWnnryLGpLsPMbNqoOQwkLZZ0j6QnJO2S9IXUPk/Sdkm70+Pc1C5J10nqkfSYpLNz21qblt8tae1Yr9koJ3a08eZh9wzMzErq6RkcA/4oIpYDK4ANkpYDlwN3R8Qy4O70HOBCsvsbLwPWA9dDFh7AlcC5ZHdIu7IUIM0yq8M9AzOzvJrDICL2R8RDafp14ElgIbAa2JwW2wxcnKZXAzdF5j5gjqTTgQuA7RHRHxEHge3AqlrrqsaJM9p468gAg4M+vdTMDBo0ZiBpCfBhYAewICL2p1kvAgvS9EJgb2613tQ2Vvtor7NeUrek7r6+vprrPeWEdgAOvX205m2Ymf1rUncYSDoJ+BHwhxHxWn5eRATQsD+/I2JjRHRFRFdnZ2fN2+mcPQOAvjcON6o0M7PjWl1hIKmdLAh+GBG3peaX0uEf0uOB1L4PWJxbfVFqG6u9aead2AFA/5tHmvkyZmbHjXrOJhJwA/BkRPzP3KxtQOmMoLXA7bn2y9JZRSuAQ+lw0l3A+ZLmpoHj81Nb05x2ctYzeOm1d5r5MmZmx422Otb9KPAfgcclPZLavgJcDWyVtA54Hvh0mncHcBHQA7wFfAYgIvolfR14IC33tYjor6OuCZ1+ykwA9h9yGJiZQR1hEBH/F9AYs1eOsnwAG8bY1iZgU621vFuzZ7Yze0YbP3/17cl6STOzaa2QVyADLJk/i2dffnOqyzAzmxYKGwbv7ZzFnj6HgZkZFDgM3td5EvtefZu3/YV1ZmbFDgOAPS+/McWVmJlNvcKGwS+eeiIAL7zy1hRXYmY29QofBs85DMzMihsGs2e2c+qsDl7o9yCymVlhwwCy3sFzL7tnYGZW6DBYcuosDyCbmVHwMPjgolN46bXDPNPnQDCzYit0GFz0b0+nRfCjB3unuhQzsylV6DA47eSZ/PszOrntoX0M+K5nZlZghQ4DgEs+spgXX3uHf+l5eapLMTObMoUPg5W/fBqnzurg6jt/xjtH/dUUZlZMhQ+Dme2tfPOSD/LE/tf4yo8f5+jA4FSXZGY26QofBgArf3kBX1i5jNse2sen/9dP2f7ESw4FMyuUeu501lCSVgHfBlqB70fE1ZP5+l/8jTNYOn8W//3OJ/nsTd3MObGdT3zgNM5dOo8zfmE2y047idkz2yezJDOzSaPsBmRTXITUCjwN/AbQS3YLzEsj4omx1unq6oru7u6G13J0YJB/eqqPv33s5/zz7pd55c0jQ/Nmz2zjtNkzmH/SDObN6uDkme2cfEIbs2e2c0J7KzM7WjmhvZUZbS10tLUMPba1tNDWKtpLj60ttLaIVomWFoamJQ1Nt7WKFgmle8mVppXtL1qUPZqZvRuSHoyIrvL26dIzOAfoiYg9AJK2AKuBMcOgWdpbW/jk8gV8cvkCBgeDF/rf4umXXueZvjd58dDbHHj9MK+8cYSeA29w6O2jvP7OMd6e4oHnUjBkQTFyukVKP9Vtq5qAqTaDqlmsqter7uWqrGuy31+122pMXdXvq8b9IVFVXVXVXl1Nk70fGvSxqnqxamr6uz/4GDPaWqt70SpNlzBYCOzNPe8Fzi1fSNJ6YD3Ae97znqYX1dIilsyfxZL5s8Zd7ujAIO8cHeDtowO8fWSAI8cGOZx+jg4McmwgODqYPR4bGOTIwCCDEQwMwmAEg4PBQHocDBgYDI4ODBJk80udt8HBIICI1E72JL9caZo0HbnXaYRqepLVvlI1JUWVW6tuW43ZTrVbq3ZbjdoPVb9eNctUva3G7PjqPzOT/fmrZjtVfkYbtlD1wfluTJcwqEpEbAQ2QnaYaIrLGdLe2kJ7a4vHFMzsuDVdzibaByzOPV+U2szMbBJMlzB4AFgmaamkDmANsG2KazIzK4xpcZgoIo5J+jxwF9mppZsiYtcUl2VmVhjTIgwAIuIO4I6prsPMrIimy2EiMzObQg4DMzNzGJiZmcPAzMyYJt9NVAtJfcDzNa4+Hzhe7mbjWpvDtTaHa22ORtb6ixHRWd543IZBPSR1j/ZFTdORa20O19ocrrU5JqNWHyYyMzOHgZmZFTcMNk51Ae+Ca20O19ocrrU5ml5rIccMzMxspKL2DMzMLMdhYGZmxQoDSaskPSWpR9LlU1jHc5Iel/SIpO7UNk/Sdkm70+Pc1C5J16WaH5N0dm47a9PyuyWtbVBtmyQdkLQz19aw2iR9JL33nrRuzbdsGqPWP5G0L+3bRyRdlJt3RXrdpyRdkGsf9XORvlJ9R2q/JX29eq21LpZ0j6QnJO2S9IXUPu327Ti1Trt9K2mmpPslPZpq/W/jbV/SjPS8J81fUut7aGCtN0p6NrdfP5TaJ/czEBGF+CH7auxngPcCHcCjwPIpquU5YH5Z2zeBy9P05cA1afoi4E6y26euAHak9nnAnvQ4N03PbUBtHwfOBnY2ozbg/rSs0roXNrjWPwH+yyjLLk//5jOApemz0Dre5wLYCqxJ098Dfq+OWk8Hzk7Ts4GnU03Tbt+OU+u027fpvZ6UptuBHWkfjLp94HPA99L0GuCWWt9DA2u9EbhklOUn9TNQpJ7BOUBPROyJiCPAFmD1FNeUtxrYnKY3Axfn2m+KzH3AHEmnAxcA2yOiPyIOAtuBVfUWERH3Av3NqC3NOzki7ovsk3tTbluNqnUsq4EtEXE4Ip4Fesg+E6N+LtJfVOcBt47yvmupdX9EPJSmXweeJLv397Tbt+PUOpYp27dp/7yRnrannxhn+/n9fSuwMtXzrt5Dg2sdy6R+BooUBguBvbnnvYz/AW+mAP5B0oOS1qe2BRGxP02/CCxI02PVPZnvp1G1LUzT5e2N9vnUrd5UOuxSQ62nAq9GxLFG15oOTXyY7C/Dab1vy2qFabhvJbVKegQ4QPaL8Zlxtj9UU5p/KNUzKf/PymuNiNJ+vSrt12slzSivtcqa6voMFCkMppOPRcTZwIXABkkfz89MqT4tz/mdzrUl1wPvAz4E7Af+x5RWU0bSScCPgD+MiNfy86bbvh2l1mm5byNiICI+RHbv9HOAX5raisZWXquks4AryGr+VbJDP1+eitqKFAb7gMW554tS26SLiH3p8QDwY7IP8Eupm0d6PJAWH6vuyXw/japtX5puWs0R8VL6DzcI/AXZvq2l1lfIuuVtZe01k9RO9sv1hxFxW2qelvt2tFqn875N9b0K3AP82jjbH6opzT8l1TOp/89yta5Kh+UiIg4Df0nt+7W+z0C1gwvH+w/ZLT73kA0OlQaCzpyCOmYBs3PT/4/sWP+fMnIg8Ztp+jcZOYh0fwwPIj1LNoA0N03Pa1CNSxg5KNuw2qgc4LqowbWenpv+ItlxYIAzGTlAuIdscHDMzwXw14wchPxcHXWK7Bjut8rap92+HafWabdvgU5gTpo+Afhn4FNjbR/YwMgB5K21vocG1np6br9/C7h6Kj4Dk/qLcKp/yEbnnyY7pvjVKarhvekD9Siwq1QH2XHLu4HdwP/J/eMK+E6q+XGgK7et3yUb6OoBPtOg+m4mOwRwlOyY47pG1gZ0ATvTOn9Ougq+gbX+INXyGLCNkb/Avppe9ylyZ1mM9blI/1b3p/fw18CMOmr9GNkhoMeAR9LPRdNx345T67Tbt8AHgYdTTTuBPx5v+8DM9LwnzX9vre+hgbX+JO3XncBfMXzG0aR+Bvx1FGZmVqgxAzMzG4PDwMzMHAZmZuYwMDMzHAZmZobDwMzMcBiYmRnw/wG9froz3Iu5aAAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id words_count\n",
- "count 250000.000000 250000.000000\n",
- "mean 124999.500000 205.830189\n",
- "std 72168.927986 47.174030\n",
- "min 0.000000 8.000000\n",
- "25% 62499.750000 187.500000\n",
- "50% 124999.500000 202.000000\n",
- "75% 187499.250000 217.750000\n",
- "max 249999.000000 3434.500000"
- ]
- },
- "execution_count": 52,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#更加详细的参数\n",
- "user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 用户点击新闻的时间分析"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 53,
- "metadata": {},
- "outputs": [],
- "source": [
- "#为了更好的可视化,这里把时间进行归一化操作\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "mm = MinMaxScaler()\n",
- "user_click_merge['click_timestamp'] = mm.fit_transform(user_click_merge[['click_timestamp']])\n",
- "user_click_merge['created_at_ts'] = mm.fit_transform(user_click_merge[['created_at_ts']])\n",
- "\n",
- "user_click_merge = user_click_merge.sort_values('click_timestamp')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 54,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 18 \n",
- " 249990 \n",
- " 162300 \n",
- " 0.000000 \n",
- " 4 \n",
- " 3 \n",
- " 20 \n",
- " 1 \n",
- " 25 \n",
- " 2 \n",
- " 5 \n",
- " 5 \n",
- " 281 \n",
- " 0.989186 \n",
- " 193 \n",
- " \n",
- " \n",
- " 2 \n",
- " 249998 \n",
- " 160974 \n",
- " 0.000002 \n",
- " 4 \n",
- " 1 \n",
- " 12 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 5 \n",
- " 5 \n",
- " 281 \n",
- " 0.989092 \n",
- " 259 \n",
- " \n",
- " \n",
- " 30 \n",
- " 249985 \n",
- " 160974 \n",
- " 0.000003 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 8 \n",
- " 2 \n",
- " 8 \n",
- " 8 \n",
- " 281 \n",
- " 0.989092 \n",
- " 259 \n",
- " \n",
- " \n",
- " 50 \n",
- " 249979 \n",
- " 162300 \n",
- " 0.000004 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 25 \n",
- " 2 \n",
- " 2 \n",
- " 2 \n",
- " 281 \n",
- " 0.989186 \n",
- " 193 \n",
- " \n",
- " \n",
- " 25 \n",
- " 249988 \n",
- " 160974 \n",
- " 0.000004 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 21 \n",
- " 2 \n",
- " 17 \n",
- " 17 \n",
- " 281 \n",
- " 0.989092 \n",
- " 259 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "plt.plot(item_click_count)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 38,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "18 249990 162300 0.000000 4 \n",
- "2 249998 160974 0.000002 4 \n",
- "30 249985 160974 0.000003 4 \n",
- "50 249979 162300 0.000004 4 \n",
- "25 249988 160974 0.000004 4 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "18 3 20 1 25 \n",
- "2 1 12 1 13 \n",
- "30 1 17 1 8 \n",
- "50 1 17 1 25 \n",
- "25 1 17 1 21 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
- "18 2 5 5 281 0.989186 \n",
- "2 2 5 5 281 0.989092 \n",
- "30 2 8 8 281 0.989092 \n",
- "50 2 2 2 281 0.989186 \n",
- "25 2 17 17 281 0.989092 \n",
- "\n",
- " words_count \n",
- "18 193 \n",
- "2 259 \n",
- "30 259 \n",
- "50 193 \n",
- "25 259 "
- ]
- },
- "execution_count": 54,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_merge.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 55,
- "metadata": {},
- "outputs": [],
- "source": [
- "def mean_diff_time_func(df, col):\n",
- " df = pd.DataFrame(df, columns={col})\n",
- " df['time_shift1'] = df[col].shift(1).fillna(0)\n",
- " df['diff_time'] = abs(df[col] - df['time_shift1'])\n",
- " return df['diff_time'].mean()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 点击时间差的平均值\n",
- "mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'click_timestamp'))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 57,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(mean_diff_click_time.values, reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从上图可以发现不同用户点击文章的时间差是有差异的"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 前后点击文章的创建时间差的平均值\n",
- "mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'created_at_ts'))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 59,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(mean_diff_created_time.values, reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从图中可以发现用户先后点击文章,文章的创建时间也是有差异的"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Defaulting to user installation because normal site-packages is not writeable\n",
- "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n",
- "Collecting gensim\n",
- " Downloading https://mirrors.aliyun.com/pypi/packages/2b/e0/fa6326251692056dc880a64eb22117e03269906ba55a6864864d24ec8b4e/gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)\n",
- "\u001b[K |████████████████████████████████| 24.2 MB 91.0 MB/s eta 0:00:01\n",
- "\u001b[?25hRequirement already satisfied: six>=1.5.0 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.15.0)\n",
- "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
- "Requirement already satisfied: scipy>=0.18.1 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.5.4)\n",
- "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
- "Collecting smart-open>=1.8.1\n",
- " Downloading https://mirrors.aliyun.com/pypi/packages/e3/cf/6311dfb0aff3e295d63930dea72e3029800242cdfe0790478e33eccee2ab/smart_open-4.0.1.tar.gz (117 kB)\n",
- "\u001b[K |████████████████████████████████| 117 kB 96.7 MB/s eta 0:00:01\n",
- "\u001b[?25hBuilding wheels for collected packages: smart-open\n",
- " Building wheel for smart-open (setup.py) ... \u001b[?25ldone\n",
- "\u001b[?25h Created wheel for smart-open: filename=smart_open-4.0.1-py3-none-any.whl size=108249 sha256=50eb67320a58790e8b173971aeb6af7b636d48259d7c9de759612e58e334215b\n",
- " Stored in directory: /home/admin/.cache/pip/wheels/c3/14/fc/a0e523e5d2f13d083ce0af09d4e2861d8e2ec65fc466fb1dff\n",
- "Successfully built smart-open\n",
- "Installing collected packages: smart-open, gensim\n",
- "Successfully installed gensim-3.8.3 smart-open-4.0.1\n"
- ]
- }
- ],
- "source": [
- "# 安装gensim\n",
- "!pip install gensim"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [],
- "source": [
- "from gensim.models import Word2Vec\n",
- "import logging, pickle\n",
- "\n",
- "# 需要注意这里模型只迭代了一次\n",
- "def trian_item_word2vec(click_df, embed_size=16, save_name='item_w2v_emb.pkl', split_char=' '):\n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " # 只有转换成字符串才可以进行训练\n",
- " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
- " # 转换成句子的形式\n",
- " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
- " docs = docs['click_article_id'].values.tolist()\n",
- "\n",
- " # 为了方便查看训练的进度,这里设定一个log信息\n",
- " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
- "\n",
- " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
- " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=10)\n",
- " \n",
- " # 保存成字典的形式\n",
- " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
- " \n",
- " return item_w2v_emb_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [],
- "source": [
- "item_w2v_emb_dict = trian_item_word2vec(user_click_merge)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 25667 \n",
- " 190841 \n",
- " 199197 \n",
- " 1507045276129 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 25668 \n",
- " 190841 \n",
- " 285298 \n",
- " 1507045302920 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 25669 \n",
- " 190841 \n",
- " 156624 \n",
- " 1507046638885 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 25670 \n",
- " 190841 \n",
- " 129029 \n",
- " 1507046668885 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 107739 \n",
- " 164226 \n",
- " 214800 \n",
- " 1507131402464 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 21 \n",
- " 2 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "plt.plot(item_click_count[:100])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看出点击次数最多的前100篇新闻,点击次数大于1000次"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 39,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(item_click_count[:20])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "点击次数最多的前20篇新闻,点击次数大于2500。思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 40,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(item_click_count[3500:])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以发现很多新闻只被点击过一两次。思路:可以定义这些新闻是冷门新闻"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻共现频次:两篇新闻连续出现的次数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 433597.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 3.184139 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 18.851753 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 2.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 2202.000000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " count\n",
+ "count 433597.000000\n",
+ "mean 3.184139\n",
+ "std 18.851753\n",
+ "min 1.000000\n",
+ "25% 1.000000\n",
+ "50% 1.000000\n",
+ "75% 2.000000\n",
+ "max 2202.000000"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tmp = user_click_merge.sort_values('click_timestamp')\n",
+ "tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].transform(lambda x:x.shift(-1))\n",
+ "union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].agg({'count'}).reset_index().sort_values('count', ascending=False)\n",
+ "union_item[['count']].describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "由统计数据可以看出,平均共现次数3.18,最高为2202。\n",
+ "\n",
+ "说明用户看的新闻,相关性是比较强的。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 42,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#画个图直观地看一看\n",
+ "x = union_item['click_article_id']\n",
+ "y = union_item['count']\n",
+ "plt.scatter(x, y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 43,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAD4CAYAAADvsV2wAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAATdElEQVR4nO3df6xkZX3H8fe37Aq2EPmxN7pd9nKhmhgxuOB1hUANISHlV+CPYrqkRUTNNoopVlsrmiCamIhNlSpG3ApF1Cr4syuFWFqwahuW7OKy/BK9KgYQ3AVkkarU1W//mLMwd5hzZ+7MmTt3znm/ksmeOeeZOd89s/dzn32ec85EZiJJqr/fG3cBkqSlYeBLUkMY+JLUEAa+JDWEgS9JDbFiXDtetWpVzszMjGv3kjSRtm3b9mhmTg3y2rEF/szMDFu3bh3X7iVpIkXETwZ9rUM6ktQQBr4kNYSBL0kNYeBLUkMY+JLUEH0HfkTsExHfjYjru2zbNyKujYi5iNgSETOVVilJGtpievgXAveWbHsj8PPMfDHwEeDSYQuTJFWrr/PwI+JQ4HTgA8DbuzQ5C7ikWP4ScHlERI7g3sv3PfIL/m3HT0u3n/CSKdYffnDVu5WkidfvhVeXAe8EDijZvgZ4ACAz90TEbuAQ4NH2RhGxEdgIMD09PUC5MLfzKT52y1zXbZlw648f57q/PG6g95akOusZ+BFxBrAzM7dFxInD7CwzNwGbAGZnZwfq/Z9+1GpOP+r0rtv+/FO38vRvfjd4gZJUY/2M4R8PnBkR9wNfAE6KiM92tHkIWAsQESuAFwCPVVhn3/z+LknqrmfgZ+ZFmXloZs4AG4CbM/MvOpptBs4rls8u2pi9krSMDHzztIh4P7A1MzcDVwKfiYg54HFavxiWXBD4e0aSultU4GfmN4FvFssXt63/NfDaKguTJFWrVlfaRoy7AklavmoV+OCkrSSVqV3gS5K6q13gO2crSd3VLvAlSd3VKvAjwjF8SSpRq8CXJJWrVeB7VqYklatV4APO2kpSiVoFvhdeSVK5WgU+eOGVJJWpXeBLkrqrVeAHDuFLUplaBb4kqVytAj+ctZWkUrUKfIB02laSuqpV4Nu/l6RytQp8cNJWksrULvAlSd3VKvAj7OFLUplaBb4kqVzNAt9pW0kqU7PA9146klSmVoHvdVeSVK5n4EfEfhFxW0TcERF3R8T7urR5fUTsiojtxeNNoym3t3TWVpK6WtFHm6eBkzLzqYhYCXwnIm7MzFs72l2bmW+tvkRJUhV6Bn62usxPFU9XFo9l2Y12REeSyvU1hh8R+0TEdmAncFNmbunS7E8jYkdEfCki1pa8z8aI2BoRW3ft2jV41ZKkResr8DPzt5m5DjgUWB8RL+9o8nVgJjOPAm4CPl3yPpsyczYzZ6empoYouzsnbSWp3KLO0snMJ4BbgFM61j+WmU8XTz8FvLKS6gbgnK0kddfPWTpTEXFgsfx84GTgex1tVrc9PRO4t8Ia+xaO4ktSqX7O0lkNfDoi9qH1C+K6zLw+It4PbM3MzcBfRcSZwB7gceD1oyq4F++HL0nd9XOWzg7g6C7rL25bvgi4qNrSJElVqt2Vto7hS1J3tQp8SVK5WgW+p2VKUrlaBT4s00uAJWkZqFXge1qmJJWrVeCDd8uUpDK1C3xJUnf1CvxwDF+SytQr8CVJpWoV+E7ZSlK5WgU+4JiOJJWoVeCHV15JUqlaBT7YwZekMrULfElSd7UK/MALrySpTK0CX5JUrlaB75ytJJWrVeCDk7aSVKZWgW8HX5LK1Srwwa84lKQytQt8SVJ3tQr8iCAdxZekrmoV+JKkcrUKfCdtJalcz8CPiP0i4raIuCMi7o6I93Vps29EXBsRcxGxJSJmRlJtH5y0laTu+unhPw2clJmvANYBp0TEsR1t3gj8PDNfDHwEuLTSKvtlF1+SSq3o1SBbN6d5qni6snh09qPPAi4plr8EXB4RkWO4sc3j//t//O0X7xj49RvWr+WVhx1cYUWStDz0DHyAiNgH2Aa8GPh4Zm7paLIGeAAgM/dExG7gEODRjvfZCGwEmJ6eHq7yLtbPHMytP3yM/557tHfjLh558tckGPiSaqmvwM/M3wLrIuJA4KsR8fLMvGuxO8vMTcAmgNnZ2cp7/xvWT7Nh/eC/SI7/4M0VViNJy8uiztLJzCeAW4BTOjY9BKwFiIgVwAuAxyqob8k56Suprvo5S2eq6NkTEc8HTga+19FsM3BesXw2cPM4xu8lSeX6GdJZDXy6GMf/PeC6zLw+It4PbM3MzcCVwGciYg54HNgwsopHzCt1JdVVP2fp7ACO7rL+4rblXwOvrbY0SVKVanWl7bD8AhVJdWbgd3JER1JNGfht7OFLqjMDv4MdfEl1ZeBLUkMY+G2CwMsHJNWVgS9JDWHgt3HSVlKdGfgdHNCRVFcGfhs7+JLqzMDv4JytpLoy8CWpIQz8NhHhGL6k2jLwJakhDPw2TtpKqjMDv4NX2kqqKwO/nV18STVm4Hewfy+prgx8SWoIA79NgF18SbVl4EtSQxj4bcLbZUqqMQO/QzqmI6mmDPw29u8l1VnPwI+ItRFxS0TcExF3R8SFXdqcGBG7I2J78bh4NOWOntddSaqrFX202QO8IzNvj4gDgG0RcVNm3tPR7tuZeUb1JUqSqtCzh5+ZD2fm7cXyL4B7gTWjLmwcIuzhS6qvRY3hR8QMcDSwpcvm4yLijoi4MSKOLHn9xojYGhFbd+3atfhqJUkD6zvwI2J/4MvA2zLzyY7NtwOHZeYrgI8BX+v2Hpm5KTNnM3N2ampqwJJHJ5y2lVRjfQV+RKykFfafy8yvdG7PzCcz86li+QZgZUSsqrTSJeJpmZLqqp+zdAK4Erg3Mz9c0uZFRTsiYn3xvo9VWehS8LorSXXWz1k6xwPnAndGxPZi3buBaYDMvAI4G3hzROwBfgVsyAm9sfxkVi1JvfUM/Mz8Dj2uScrMy4HLqypKklQ9r7TtYAdfUl0Z+JLUEAZ+G++WKanODPwOTtpKqisDv439e0l1ZuA/h118SfVk4EtSQxj4bbxbpqQ6M/AlqSEM/DYRjuBLqi8DX5IawsBv4/3wJdWZgd9hQm/yKUk9GfiS1BAGfhsnbSXVmYEvSQ1h4LcJvPBKUn0Z+JLUEAZ+O++HL6nGDPwOjuhIqisDX5IawsBv05q0tY8vqZ4MfElqCAO/jXO2kuqsZ+BHxNqIuCUi7omIuyPiwi5tIiI+GhFzEbEjIo4ZTbmSpEGt6KPNHuAdmXl7RBwAbIuImzLznrY2pwIvKR6vBj5R/DlR7OBLqrOegZ+ZDwMPF8u/iIh7gTVAe+CfBVyTrRnPWyPiwIhYXbx2otz50G7OvXLLuMt4jvOPn+Gkl75w3GVImmD99PCfEREzwNFAZyKuAR5oe/5gsW5e4EfERmAjwPT09CJLHb0zjvpDvr7jpzz19J5xlzLP3Q89ydQB+xr4kobSd+BHxP7Al4G3ZeaTg+wsMzcBmwBmZ2eX3fmPbzjhcN5wwuHjLuM5Trj05nGXIKkG+jpLJyJW0gr7z2XmV7o0eQhY2/b80GKdqrLsfj1KmjT9nKUTwJXAvZn54ZJmm4HXFWfrHAvsnsTxe0mqs36GdI4HzgXujIjtxbp3A9MAmXkFcANwGjAH/BI4v/JKG8wvZpFUhX7O0vkOPc5YLM7OuaCqoiRJ1fNK2wkQXiEgqQIG/oTwpm6ShmXgTwDv8SOpCgb+hLB/L2lYBr4kNYSBPwFaX8wy7iokTToDX5IawsCfABHhGL6koRn4ktQQBv4E8KxMSVUw8CeEF15JGpaBL0kNYeBPAu+WKakCBr4kNYSBPwEC7OJLGpqBL0kNYeBPgPB2mZIqYOBPiHRMR9KQDHxJaggDfwJ4t0xJVTDwJakhDPwJEGEPX9LwDHxJaggDfwKE98uUVIGegR8RV0XEzoi4q2T7iRGxOyK2F4+Lqy9TnpYpaVgr+mhzNXA5cM0Cbb6dmWdUUpEkaSR69vAz81vA40tQi0o4aSupClWN4R8XEXdExI0RcWRZo4jYGBFbI2Lrrl27Ktq1JKkfVQT+7cBhmfkK4GPA18oaZuamzJzNzNmpqakKdt0cdvAlDWvowM/MJzPzqWL5BmBlRKwaujJJUqWGDvyIeFEUt3OMiPXFez427PvqWd4tU1IVep6lExGfB04EVkXEg8B7gZUAmXkFcDbw5ojYA/wK2JB+43blPKKShtUz8DPznB7bL6d12qYkaRnzStsJ0BrQsYsvaTgGviQ1hIE/AbzwSlIVDHxJaggDfwJ4VqakKhj4E8IRHUnDMvAngPfDl1QFA39CeC2bpGEZ+JLUEAb+BIhwDF/S8Ax8SWoIA38COGUrqQoG/oRwzlbSsAz8SeCVV5IqYOBPCDv4koZl4EtSQxj4EyDwwitJwzPwJakhDPwJ4JytpCoY+JLUEAb+BLCDL6kKBv6EcM5W0rAMfElqCAN/AkQE6aVXkobUM/Aj4qqI2BkRd5Vsj4j4aETMRcSOiDim+jIlScPqp4d/NXDKAttPBV5SPDYCnxi+LLVz0lZSFVb0apCZ34qImQWanAVck61LQW+NiAMjYnVmPlxVkYLbf/IEJ3/4v8ZdhqQK/Nmr1vKmPz5iyffbM/D7sAZ4oO35g8W65wR+RGyk9b8ApqenK9h1M5x73GF84+5Hxl2GpIqs2n/fsey3isDvW2ZuAjYBzM7OOgvZp7PWreGsdWvGXYakCVfFWToPAWvbnh9arJMkLSNVBP5m4HXF2TrHArsdv5ek5afnkE5EfB44EVgVEQ8C7wVWAmTmFcANwGnAHPBL4PxRFStJGlw/Z+mc02N7AhdUVpEkaSS80laSGsLAl6SGMPAlqSEMfElqiBjXl2NHxC7gJwO+fBXwaIXlVMnaFm+51gXWNojlWhfUo7bDMnNqkB2MLfCHERFbM3N23HV0Y22Lt1zrAmsbxHKtC6zNIR1JaggDX5IaYlIDf9O4C1iAtS3ecq0LrG0Qy7UuaHhtEzmGL0lavEnt4UuSFsnAl6SmyMyJetD6ft37aN2d810j3M/9wJ3AdmBrse5g4CbgB8WfBxXrA/hoUdMO4Ji29zmvaP8D4Ly29a8s3n+ueG0sUMtVwE7grrZ1I6+lbB991HYJre9E2F48TmvbdlGxn/uAP+n1uQKHA1uK9dcCzyvW71s8nyu2z3TUtRa4BbgHuBu4cLkctwVqG+txA/YDbgPuKOp63xDvVUm9fdR2NfDjtmO2bkw/B/sA3wWuXy7HrGuWjCowR/EoDuoPgSOA5xUf/stGtK/7gVUd6z6094AD7wIuLZZPA24s/pEdC2xp+4fyo+LPg4rlvQFzW9E2iteeukAtrwGOYX6ojryWsn30UdslwN90afuy4jPbt/jH+sPiMy39XIHrgA3F8hXAm4vltwBXFMsbgGs79rWa4occOAD4frH/sR+3BWob63Er/h77F8sraYXJsYt9ryrr7aO2q4Gzuxyzpf45eDvwLzwb+GM/Zl2zZBRhOaoHcBzwjbbnFwEXjWhf9/PcwL8PWN32Q3tfsfxJ4JzOdsA5wCfb1n+yWLca+F7b+nntSuqZYX6ojryWsn30UdsldA+ueZ8X8I3iM+36uRY/eI8CKzo//72vLZZXFO0W+l/SvwInL6fj1qW2ZXPcgN8Hbgdevdj3qrLekuPVXtvVdA/8Jfs8aX3L338CJwHXD3L8R33M9j4mbQy/7AvTRyGBf4+IbcWXrwO8MJ/9Nq9HgBf2qGuh9Q92Wb8YS1FL2T768daI2BERV0XEQQPWdgjwRGbu6VLbM68ptu8u2j9HRMwAR9PqFS6r49ZRG4z5uEXEPhGxndYw3U20epeLfa8q620/VvNqy8y9x+wDxTH7SETs/Xbwpfw8LwPeCfyueD7I8R/JMes0aYG/lE7IzGOAU4ELIuI17Ruz9Ws1x1JZh6WoZZH7+ATwR8A64GHgH0ZUVk8RsT/wZeBtmflk+7ZxH7cutY39uGXmbzNzHa1e63rgpUtdQ5nO2iLi5bR6uy8FXkVrmObvRlzDvM8zIs4AdmbmtlHutyqTFvhL9oXpmflQ8edO4Ku0/vH/LCJWAxR/7uxR10LrD+2yfjGWopayfSwoM39W/HD+DvgnWsdukNoeAw6MiBUd6+e9V7H9BUX7Z0TESlqB+rnM/EqPv9OSHrdutS2X41bU8gStieXjBnivKut9jrbaTsnMh7PlaeCfGfyYDfp5Hg+cGRH3A1+gNazzjwv8fcZyzJ7Ra8xnOT1ojXn9iNakxt4JjCNHsJ8/AA5oW/4fWjPlf8/8yZsPFcunM3+C6LZi/cG0ziA4qHj8GDi42NY5QXRaj5pmmD9OPvJayvbRR22r25b/GvhCsXwk8yemfkRrUqr0cwW+yPyJqbcUyxcwf/Lruo6aArgGuKxj/diP2wK1jfW4AVPAgcXy84FvA2cs9r2qrLeP2la3HdPLgA+O8efgRJ6dtB37MetaY9VhOeoHrdn379MaW3zPiPZxRHFg954C9p5i/SG0Jmd+APxH2z+UAD5e1HQnMNv2Xm+gddrUHHB+2/pZ4K7iNZez8ITj52n9F/83tMbq3rgUtZTto4/aPlPsewewmflB9p5iP/fRdmZS2edafBa3FTV/Edi3WL9f8Xyu2H5ER10n0Pqv9w7aTnNcDsdtgdrGetyAo2idWrij+HtdPMR7VVJvH7XdXByzu4DP8uyZPEv6c1C0O5FnA3/sx6zbw1srSFJDTNoYviRpQAa+JDWEgS9JDWHgS1JDGPiS1BAGviQ1hIEvSQ3x/4tppPoWqYdUAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(union_item['count'].values[40000:])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "大概有75000个pair至少共现一次"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻文章信息"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 44,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#不同类型的新闻出现的次数\n",
+ "plt.plot(user_click_merge['category_id'].value_counts().values)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#出现次数比较少的新闻类型, 有些新闻类型,基本上就出现过几次\n",
+ "plt.plot(user_click_merge['category_id'].value_counts().values[150:])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "count 1.630633e+06\n",
+ "mean 2.043012e+02\n",
+ "std 6.382198e+01\n",
+ "min 0.000000e+00\n",
+ "25% 1.720000e+02\n",
+ "50% 1.970000e+02\n",
+ "75% 2.290000e+02\n",
+ "max 6.690000e+03\n",
+ "Name: words_count, dtype: float64"
+ ]
+ },
+ "execution_count": 46,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#新闻字数的描述性统计\n",
+ "user_click_merge['words_count'].describe()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 47,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(user_click_merge['words_count'].values)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击的新闻类型的偏好\n",
+ "\n",
+ "此特征可以用于度量用户的兴趣是否广泛。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUlUlEQVR4nO3dfZBc1Xnn8e8zM3pBaCwkNBJCAiQbsKwEy8CYwoEihTG2wXGwY5dDditWHGrZsp3EjpNdw9q1dtXGu3YqNvFWsomJIaESyoGAMSQFwRhjezeJJY+MAAsEEuJFEnoZAXpBGAlJZ//oK2UkzfRtzfR097nz/VRNze3Tt/s+Z27rp9unT98bKSUkSfnrancBkqTmMNAlqSIMdEmqCANdkirCQJekiuhp5cZmz56dFi5c2MpNSlL2Vq5cuT2l1Fe2XksDfeHChQwMDLRyk5KUvYh4rpH1HHKRpIow0CWpIgx0SaoIA12SKsJAl6SKMNAlqSIMdEmqiCwC/a6HN/J3P25oGqYkTVhZBPo9q17g9oEN7S5DkjpaFoEuSSpnoEtSRWQT6F4pT5LqyyLQI6LdJUhSx8si0CVJ5Qx0SaqIbAI94SC6JNWTRaA7gi5J5bIIdElSOQNdkioim0B3Hrok1ZdFoDsNXZLKZRHokqRy2QS6Qy6SVF8mge6YiySVySTQJUllDHRJqohsAt0hdEmqL4tAd9qiJJXLItAlSeUMdEmqiGwCPTkRXZLqyiLQHUKXpHJZBLokqZyBLkkVYaBLUkVkEejOQ5ekclkEuiSpXEOBHhG/HxGrI+JnEfGtiJgaEYsiYnlErIuI2yJi8ngXK0kaWWmgR8R84PeA/pTSLwLdwNXAV4AbUkpnAi8D14xnoU5Dl6T6Gh1y6QFOiIgeYBqwGXgncEdx/y3AB5peXSGciS5JpUoDPaW0CfgT4HlqQb4TWAnsSCntL1bbCMwf7vERcW1EDETEwODgYHOqliQdo5Ehl5nAVcAi4FTgROC9jW4gpXRjSqk/pdTf19c36kIlSfU1MuTyLuCZlNJgSul14NvARcBJxRAMwAJg0zjVCEDyjOiSVFcjgf48cGFETIuIAC4DHgceAj5crLMMuHt8SnQeuiQ1opEx9OXUPvz8KfBY8Zgbgc8Cn4mIdcDJwE3jWKckqURP+SqQUvoC8IWjmtcDFzS9IknSqGTzTVHnoUtSfVkEumPoklQui0CXJJUz0CWpIrIJdIfQJam+LALdc7lIUrksAl2SVC6bQE/OW5SkuvIIdEdcJKlUHoEuSSploEtSRWQT6I6gS1J9WQS6Q+iSVC6LQJcklTPQJaki8gl0B9Elqa4sAj08f64klcoi0CVJ5Qx0SaqIbALdIXRJqi+LQHcEXZLKZRHokqRyBrokVUQ2ge750CWpviwC3WnoklQui0CXJJUz0CWpIrIJdEfQJam+LALdIXRJKpdFoEuSyhnoklQR2QS609Alqb4sAt3zoUtSuYYCPSJOiog7ImJNRDwREe+IiFkR8UBErC1+zxzvYiVJI2v0CP3rwD+nlBYDS4EngOuAB1NKZwEPFrclSW1SGugRMQO4BLgJIKW0L6W0A7gKuKVY7RbgA+NTYk1yJrok1dXIEfoiYBD464h4OCK+GREnAnNTSpuLdbYAc4d7cERcGxEDETEwODg4qiIdQZekco0Eeg9wHvAXKaVzgT0cNbySaqdCHPYQOqV0Y0qpP6XU39fXN9Z6JUkjaCTQNwIbU0rLi9t3UAv4rRExD6D4vW18Sqxx2qIk1Vca6CmlLcCGiHhz0XQZ8DhwD7CsaFsG3D0uFYJjLpLUgJ4G1/td4NaImAysBz5G7T+D2yPiGuA54CPjU6IkqRENBXpKaRXQP8xdlzW1GknSqGXxTVFwDF2SymQR6OEguiSVyiLQJUnlDHRJqggDXZIqIotA9+y5klQui0CXJJUz0CWpIrIJ9OREdEmqK4tAdwhdksplEeiSpHIGuiRVRDaB7gi6JNWXRaA7D12SymUR6JKkcga6JFVENoHuNHRJqi+LQPd86JJULotAlySVM9AlqSKyCfTkTHRJqiuLQHceuiSVyyLQJUnlsgl0py1KUn1ZBLpDLpJULotAlySVM9AlqSKyCXSH0CWpvkwC3UF0SSqTSaBLkspkE+hOW5Sk+rII9NcPHGT7K3vbXYYkdbQsAv3nrx+gd2pPu8uQpI7WcKBHRHdEPBwR/1TcXhQRyyNiXUTcFhGTx6vIOb1TnOYiSSWO5wj9U8ATQ25/BbghpXQm8DJwTTMLG6o7ggMOoktSXQ0FekQsAN4HfLO4HcA7gTuKVW4BPjAO9QHQ3RUcOGigS1I9jR6h/ynwX4GDxe2TgR0ppf3F7Y3A/OEeGBHXRsRARAwMDg6Orsiu4KBH6JJUV2mgR8SvANtSSitHs4GU0o0ppf6UUn9fX99onqI25OIRuiTV1cjUkYuAX42IK4GpwBuArwMnRURPcZS+ANg0XkV2BZjnklRf6RF6Sun6lNKClNJC4Grg+yml/wg8BHy4WG0ZcPe4FdlV++r/QVNdkkY0lnnonwU+ExHrqI2p39Scko7VXZwQ3ZkukjSy4/q2TkrpB8APiuX1wAXNL+lYh47QDxxMTOpuxRYlKT9ZfFN0589fB2Dv/oMla0rSxJVFoJ86YyqAM10kqY4sAr2nu1bm/oMeoUvSSPII9GIMff8Bj9AlaSR5BPqhI3QDXZJGlEegHzpCd8hFkkaURaDvO1AL8hf37GtzJZLUubII9FNnnAD4TVFJqieLQJ86qVbmoSN1SdKxsgj0ScWHovv8YpEkjSiLQO/prn0o+tyLr7a5EknqXFkE+uzpUwCYMimLciWpLbJIyCk9DrlIUpksAn2ygS5JpfII9OJD0Uc27mhvIZLUwbII9ENf/T8020WSdKxsEnLJvDfw1NZX2l2GJHWsbAJ9z779nDjZyxVJ0kiyCfSz5/by6Kad7S5DkjpWNoH+2usHiHYXIUkdLJtAP/f0mezdf9ATdEnSCLIJ9JRqQf7Czp+3uRJJ6kzZBPpb5r0BgMHde9tciSR1pmwCfcYJkwBYs2V3myuRpM6UTaAvPqUXgE0vO+QiScPJJtB7p9aO0Fc8+1KbK5GkzpRNoE/u6WLxKb28+Ipj6JI0nGwCHeCkaZN4enAPe/bub3cpktRxsgr0S87uA+ClPfvaXIkkdZ6sAv3MvukA3PaTDW2uRJI6T1aB/stvrh2hv+KQiyQdI6tAn9LTTV/vFP7mX5/l1X2GuiQNlVWgA1z0ppMBvzEqSUcrDfSIOC0iHoqIxyNidUR8qmifFREPRMTa4vfM8S8XrjhnHgB/8t2nWrE5ScpGI0fo+4E/SCktAS4EPhkRS4DrgAdTSmcBDxa3x92Fb6wdoT+7fU8rNidJ2SgN9JTS5pTST4vl3cATwHzgKuCWYrVbgA+MU41HmHHCJN6/9FQe27STex/b3IpNSlIWjmsMPSIWAucCy4G5KaVDiboFmDvCY66NiIGIGBgcHBxLrYf92rnzAfju6i1NeT5JqoKGAz0ipgN3Ap9OKe0ael+qnax82CtPpJRuTCn1p5T6+/r6xlTsIZcunsPiU3r5zqoXWPGM53aRJGgw0CNiErUwvzWl9O2ieWtEzCvunwdsG58Sh/f+pacCcNfDm1q5WUnqWI3McgngJuCJlNLXhtx1D7CsWF4G3N388kb2yUvPZOHJ01i+/kV+8GRL/y+RpI7UyBH6RcBvAu+MiFXFz5XAl4HLI2It8K7idktddOZsNrz8Kv/noadbvWlJ6jg9ZSuklP4fECPcfVlzyzk+X/rgOWzbvZdHNuzgtp88z0f6T6P2hkKSJp7svil6tCXz3sC23Xv57J2P8cLO19pdjiS1TfaB/vuXn82f/YdzAbhz5UbWbNlV8ghJqqbsAx3gjFknAvC1B57iD25/pM3VSFJ7VCLQz1kwg4HPv4srfvEUtu56jR89NcimHV5MWtLEUolAB5g9fQoLZ5/I9lf28dGbV/Cfbhlod0mS1FKVCXSAT112Fnd+/Je4fMlctu56jSe37Gb94CvUvsgqSdVWqUCfOqmb88+Yydlzp/Pinn28509/xDu/+kP+8VFP4iWp+krnoefo2kvexDnzZ/Da6wf59G2reGZwDzte3UcQzJg2qd3lSdK4iFYOR/T396eBgdaNbaeUOPvz9/H6gX/v4/VXLOY///KbWlaDJI1VRKxMKfWXrVfJI/RDIoIbP9p/+GIYNzzwFOsHvTCGpGqqdKADXPrmOfDm2vKty5/nH1Zu4DuramdojIAvvv8XuPqC09tYoSQ1R+UDfajPXfkWfvzMi4dv/92/PccjG3dy9QVtLEqSmmRCBfqli+dw6eI5h28/sHord6/axL+s2364racr+J+/ds7ha5dKUi4mVKAf7ROXnnlEmKeU+M6qFxh49iUDXVJ2Kj3LZTTO/tx9zJ4+mQUzpx3R3tUF/+U9izn/jJltqkzSRNXoLJdKfbGoGX7rooWccfKJdHfFET8/Xv8SP/TKSJI62IQechnOf7vyLcO2n/OF+/mnxzazfvvw0x7n9E7l8+97C11dXmBDUnsY6A1631vnseLZl3h887HnW9/92n4Gd+/lYxct5LRZ04Z5tCSNPwO9QV/+0FtHvO/exzbziVt/yg3fe4qZ0ybXfZ63LpjBVW+b3+zyJMlAb4az5/Yye/oUvrt6a9319u4/QO/USQa6pHFhoDfBmXOmM/D5d5Wu9+X71vDN/7uev/rR+oafu6sr+NWlp9LXO2UsJUqaAAz0FjprznT2H0x86d4njutxr+7dz+9edtY4VSWpKgz0FvrQ+Qu44pxTOHgcU//f/kff4+ENO7i7OP/MaJw+axrnnu78eanqDPQWmzb5+P7k82eewPfXbOP7a0Y/B35KTxdr/sd7iXBKpVRlBnqHu+sTv8S23XtH/fjbBzbwjR+u50drtzOlp3nfI+vuCpYuOInJTXxOSWNjoHe43qmT6J06+qssLT6lF4BlN69oVkmH/fdfWcJvX7yo6c8raXQM9Ip7/1tP5bSZ09h34GBTn3fZzStYu+2VwxcPaYepk7o5ZcbUtm1f6jQGesX1dHfRv3BW05931omT+daK5/nWiueb/tzH486Pv4Pzz2h+/6QcGegalZuWvZ2123a3bfvbdu3lf923hme2v8qSeTPaVsfRpk7q8sNntY2nz1WWtu1+jQu+9GC7yzjGh85bwFc/srTdZahivEi0Km1O71Ru+PWlbN01+hlAzXbHyo08tbV971okA13Z+uC5C9pdwhEef2EX//joC5zzxfvbXcqE9JnLz+ZjF03sWVcGutQk11y8iJOn1z/bpsbHXQ9vYuC5lw30sTw4It4LfB3oBr6ZUvpyU6qSMrT0tJNYetpJ7S5jQlr53Mv8YM02Lv/aD9tdyohuWvZ2Tj95fK+XMOpAj4hu4M+By4GNwE8i4p6U0uPNKk6SGnHNxYu4f/WWdpdRVyu+VT2WI/QLgHUppfUAEfH3wFWAgS6ppa5623yvM8DYLhI9H9gw5PbGou0IEXFtRAxExMDg4OAYNidJqmfc3wOklG5MKfWnlPr7+vrGe3OSNGGNJdA3AacNub2gaJMktcFYAv0nwFkRsSgiJgNXA/c0pyxJ0vEa9YeiKaX9EfE7wP3Upi3enFJa3bTKJEnHZUzz0FNK9wL3NqkWSdIYeLkZSaoIA12SKqKlp8+NiEHguVE+fDawvYnl5MA+Twz2ufrG2t8zUkql875bGuhjEREDjZwPuErs88Rgn6uvVf11yEWSKsJAl6SKyCnQb2x3AW1gnycG+1x9LelvNmPokqT6cjpClyTVYaBLUkVkEegR8d6IeDIi1kXEde2u53hFxLMR8VhErIqIgaJtVkQ8EBFri98zi/aIiP9d9PXRiDhvyPMsK9ZfGxHLhrSfXzz/uuKx0YY+3hwR2yLiZ0Paxr2PI22jjX3+YkRsKvb1qoi4csh91xf1PxkR7xnSPuzruzjx3fKi/bbiJHhExJTi9rri/oUt6u9pEfFQRDweEasj4lNFe2X3c50+d+Z+Til19A+1E389DbwRmAw8Aixpd13H2YdngdlHtf0xcF2xfB3wlWL5SuA+IIALgeVF+yxgffF7ZrE8s7hvRbFuFI+9og19vAQ4D/hZK/s40jba2OcvAn84zLpLitfuFGBR8Zrurvf6Bm4Hri6W/xL4eLH8CeAvi+Wrgdta1N95wHnFci/wVNGvyu7nOn3uyP3c0n/0o/yDvgO4f8jt64Hr213XcfbhWY4N9CeBeUNeNE8Wy98AfuPo9YDfAL4xpP0bRds8YM2Q9iPWa3E/F3JkuI17H0faRhv7PNI/9CNet9TOUvqOkV7fRaBtB3qK9sPrHXpssdxTrBdt2N93U7umcOX38zB97sj9nMOQS0OXuutwCfhuRKyMiGuLtrkppc3F8hZgbrE8Un/rtW8cpr0TtKKPI22jnX6nGGK4ecjQwPH2+WRgR0pp/1HtRzxXcf/OYv2WKd7+nwssZ4Ls56P6DB24n3MI9Cq4OKV0HnAF8MmIuGTonan2X3Cl54+2oo8d8nf8C+BNwNuAzcBX21rNOIiI6cCdwKdTSruG3lfV/TxMnztyP+cQ6Nlf6i6ltKn4vQ24C7gA2BoR8wCK39uK1Ufqb732BcO0d4JW9HGkbbRFSmlrSulASukg8FfU9jUcf59fBE6KiJ6j2o94ruL+GcX64y4iJlELtltTSt8umiu9n4frc6fu5xwCPetL3UXEiRHRe2gZeDfwM2p9OPTp/jJqY3MU7R8tZghcCOws3mreD7w7ImYWb+/eTW2sbTOwKyIuLGYEfHTIc7VbK/o40jba4lDoFD5IbV9Drc6ri5kLi4CzqH0AOOzruzgKfQj4cPH4o/9+h/r8YeD7xfrjqvjb3wQ8kVL62pC7KrufR+pzx+7ndnywMIoPIq6k9uny08Dn2l3Pcdb+RmqfaD8CrD5UP7WxsAeBtcD3gFlFewB/XvT1MaB/yHP9NrCu+PnYkPb+4gX1NPBntOcDsm9Re+v5OrVxwGta0ceRttHGPv9t0adHi3+Q84as/7mi/icZMhNppNd38dpZUfwt/gGYUrRPLW6vK+5/Y4v6ezG1oY5HgVXFz5VV3s91+tyR+9mv/ktSReQw5CJJaoCBLkkVYaBLUkUY6JJUEQa6JFWEgS5JFWGgS1JF/H85cMkmMcaqfgAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " category_id \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 250000.000000 \n",
+ " 250000.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 124999.500000 \n",
+ " 4.573188 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 72168.927986 \n",
+ " 4.419800 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 0.000000 \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 62499.750000 \n",
+ " 2.000000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 124999.500000 \n",
+ " 3.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 187499.250000 \n",
+ " 6.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 249999.000000 \n",
+ " 95.000000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id category_id\n",
+ "count 250000.000000 250000.000000\n",
+ "mean 124999.500000 4.573188\n",
+ "std 72168.927986 4.419800\n",
+ "min 0.000000 1.000000\n",
+ "25% 62499.750000 2.000000\n",
+ "50% 124999.500000 3.000000\n",
+ "75% 187499.250000 6.000000\n",
+ "max 249999.000000 95.000000"
+ ]
+ },
+ "execution_count": 49,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户查看文章的长度的分布\n",
+ "\n",
+ "通过统计不同用户点击新闻的平均字数,这个可以反映用户是对长文更感兴趣还是对短文更感兴趣。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 50,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从上图中可以发现有一小部分人看的文章平均词数非常高,也有一小部分人看的平均文章次数非常低。\n",
+ "\n",
+ "大多数人偏好于阅读字数在200-400字之间的新闻。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#挑出大多数人的区间仔细看看\n",
+ "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以发现大多数人都是看250字以下的文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 250000.000000 \n",
+ " 250000.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 124999.500000 \n",
+ " 205.830189 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 72168.927986 \n",
+ " 47.174030 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 0.000000 \n",
+ " 8.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 62499.750000 \n",
+ " 187.500000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 124999.500000 \n",
+ " 202.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 187499.250000 \n",
+ " 217.750000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 249999.000000 \n",
+ " 3434.500000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id words_count\n",
+ "count 250000.000000 250000.000000\n",
+ "mean 124999.500000 205.830189\n",
+ "std 72168.927986 47.174030\n",
+ "min 0.000000 8.000000\n",
+ "25% 62499.750000 187.500000\n",
+ "50% 124999.500000 202.000000\n",
+ "75% 187499.250000 217.750000\n",
+ "max 249999.000000 3434.500000"
+ ]
+ },
+ "execution_count": 52,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#更加详细的参数\n",
+ "user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 用户点击新闻的时间分析"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#为了更好的可视化,这里把时间进行归一化操作\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "mm = MinMaxScaler()\n",
+ "user_click_merge['click_timestamp'] = mm.fit_transform(user_click_merge[['click_timestamp']])\n",
+ "user_click_merge['created_at_ts'] = mm.fit_transform(user_click_merge[['created_at_ts']])\n",
+ "\n",
+ "user_click_merge = user_click_merge.sort_values('click_timestamp')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 18 \n",
+ " 249990 \n",
+ " 162300 \n",
+ " 0.000000 \n",
+ " 4 \n",
+ " 3 \n",
+ " 20 \n",
+ " 1 \n",
+ " 25 \n",
+ " 2 \n",
+ " 5 \n",
+ " 5 \n",
+ " 281 \n",
+ " 0.989186 \n",
+ " 193 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 249998 \n",
+ " 160974 \n",
+ " 0.000002 \n",
+ " 4 \n",
+ " 1 \n",
+ " 12 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 5 \n",
+ " 5 \n",
+ " 281 \n",
+ " 0.989092 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 30 \n",
+ " 249985 \n",
+ " 160974 \n",
+ " 0.000003 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 8 \n",
+ " 2 \n",
+ " 8 \n",
+ " 8 \n",
+ " 281 \n",
+ " 0.989092 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 50 \n",
+ " 249979 \n",
+ " 162300 \n",
+ " 0.000004 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 25 \n",
+ " 2 \n",
+ " 2 \n",
+ " 2 \n",
+ " 281 \n",
+ " 0.989186 \n",
+ " 193 \n",
+ " \n",
+ " \n",
+ " 25 \n",
+ " 249988 \n",
+ " 160974 \n",
+ " 0.000004 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 21 \n",
+ " 2 \n",
+ " 17 \n",
+ " 17 \n",
+ " 281 \n",
+ " 0.989092 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "18 249990 162300 0.000000 4 \n",
+ "2 249998 160974 0.000002 4 \n",
+ "30 249985 160974 0.000003 4 \n",
+ "50 249979 162300 0.000004 4 \n",
+ "25 249988 160974 0.000004 4 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "18 3 20 1 25 \n",
+ "2 1 12 1 13 \n",
+ "30 1 17 1 8 \n",
+ "50 1 17 1 25 \n",
+ "25 1 17 1 21 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
+ "18 2 5 5 281 0.989186 \n",
+ "2 2 5 5 281 0.989092 \n",
+ "30 2 8 8 281 0.989092 \n",
+ "50 2 2 2 281 0.989186 \n",
+ "25 2 17 17 281 0.989092 \n",
+ "\n",
+ " words_count \n",
+ "18 193 \n",
+ "2 259 \n",
+ "30 259 \n",
+ "50 193 \n",
+ "25 259 "
+ ]
+ },
+ "execution_count": 54,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_merge.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 55,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def mean_diff_time_func(df, col):\n",
+ " df = pd.DataFrame(df, columns={col})\n",
+ " df['time_shift1'] = df[col].shift(1).fillna(0)\n",
+ " df['diff_time'] = abs(df[col] - df['time_shift1'])\n",
+ " return df['diff_time'].mean()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 点击时间差的平均值\n",
+ "mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'click_timestamp'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 57,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(mean_diff_click_time.values, reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从上图可以发现不同用户点击文章的时间差是有差异的"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 前后点击文章的创建时间差的平均值\n",
+ "mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'created_at_ts'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 59,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(mean_diff_created_time.values, reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从图中可以发现用户先后点击文章,文章的创建时间也是有差异的"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Defaulting to user installation because normal site-packages is not writeable\n",
+ "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n",
+ "Collecting gensim\n",
+ " Downloading https://mirrors.aliyun.com/pypi/packages/2b/e0/fa6326251692056dc880a64eb22117e03269906ba55a6864864d24ec8b4e/gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)\n",
+ "\u001b[K |████████████████████████████████| 24.2 MB 91.0 MB/s eta 0:00:01\n",
+ "\u001b[?25hRequirement already satisfied: six>=1.5.0 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.15.0)\n",
+ "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
+ "Requirement already satisfied: scipy>=0.18.1 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.5.4)\n",
+ "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
+ "Collecting smart-open>=1.8.1\n",
+ " Downloading https://mirrors.aliyun.com/pypi/packages/e3/cf/6311dfb0aff3e295d63930dea72e3029800242cdfe0790478e33eccee2ab/smart_open-4.0.1.tar.gz (117 kB)\n",
+ "\u001b[K |████████████████████████████████| 117 kB 96.7 MB/s eta 0:00:01\n",
+ "\u001b[?25hBuilding wheels for collected packages: smart-open\n",
+ " Building wheel for smart-open (setup.py) ... \u001b[?25ldone\n",
+ "\u001b[?25h Created wheel for smart-open: filename=smart_open-4.0.1-py3-none-any.whl size=108249 sha256=50eb67320a58790e8b173971aeb6af7b636d48259d7c9de759612e58e334215b\n",
+ " Stored in directory: /home/admin/.cache/pip/wheels/c3/14/fc/a0e523e5d2f13d083ce0af09d4e2861d8e2ec65fc466fb1dff\n",
+ "Successfully built smart-open\n",
+ "Installing collected packages: smart-open, gensim\n",
+ "Successfully installed gensim-3.8.3 smart-open-4.0.1\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 安装gensim\n",
+ "!pip install gensim"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from gensim.models import Word2Vec\n",
+ "import logging, pickle\n",
+ "\n",
+ "# 需要注意这里模型只迭代了一次\n",
+ "def trian_item_word2vec(click_df, embed_size=16, save_name='item_w2v_emb.pkl', split_char=' '):\n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " # 只有转换成字符串才可以进行训练\n",
+ " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
+ " # 转换成句子的形式\n",
+ " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
+ " docs = docs['click_article_id'].values.tolist()\n",
+ "\n",
+ " # 为了方便查看训练的进度,这里设定一个log信息\n",
+ " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
+ "\n",
+ " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
+ " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=10)\n",
+ " \n",
+ " # 保存成字典的形式\n",
+ " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
+ " \n",
+ " return item_w2v_emb_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "item_w2v_emb_dict = trian_item_word2vec(user_click_merge)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 25667 \n",
+ " 190841 \n",
+ " 199197 \n",
+ " 1507045276129 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 25668 \n",
+ " 190841 \n",
+ " 285298 \n",
+ " 1507045302920 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 25669 \n",
+ " 190841 \n",
+ " 156624 \n",
+ " 1507046638885 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 25670 \n",
+ " 190841 \n",
+ " 129029 \n",
+ " 1507046668885 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 107739 \n",
+ " 164226 \n",
+ " 214800 \n",
+ " 1507131402464 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 21 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id ... click_referrer_type\n",
+ "25667 190841 ... 2\n",
+ "25668 190841 ... 2\n",
+ "25669 190841 ... 2\n",
+ "25670 190841 ... 2\n",
+ "107739 164226 ... 2\n",
+ "\n",
+ "[5 rows x 9 columns]"
+ ]
+ },
+ "execution_count": 36,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# 随机选择5个用户,查看这些用户前后查看文章的相似性\n",
+ "sub_user_ids = np.random.choice(user_click_merge.user_id.unique(), size=15, replace=False)\n",
+ "sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]\n",
+ "\n",
+ "sub_user_info.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 上一个版本,这个函数使用的是赛题提供的词向量,但是由于给出的embedding并不是所有的数据的embedding,所以运行下面画图函数的时候会报keyerror的错误\n",
+ "# 为了防止出现这个错误,这里修改为使用word2vec训练得到的词向量进行可视化\n",
+ "def get_item_sim_list(df):\n",
+ " sim_list = []\n",
+ " item_list = df['click_article_id'].values\n",
+ " for i in range(0, len(item_list)-1):\n",
+ " emb1 = item_w2v_emb_dict[str(item_list[i])] # 需要注意的是word2vec训练时候使用的是str类型的数据\n",
+ " emb2 = item_w2v_emb_dict[str(item_list[i+1])]\n",
+ " sim_list.append(np.dot(emb1,emb2)/(np.linalg.norm(emb1)*(np.linalg.norm(emb2))))\n",
+ " sim_list.append(0)\n",
+ " return sim_list"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id ... click_referrer_type\n",
- "25667 190841 ... 2\n",
- "25668 190841 ... 2\n",
- "25669 190841 ... 2\n",
- "25670 190841 ... 2\n",
- "107739 164226 ... 2\n",
- "\n",
- "[5 rows x 9 columns]"
- ]
- },
- "execution_count": 36,
- "metadata": {},
- "output_type": "execute_result"
+ "source": [
+ "for _, user_df in sub_user_info.groupby('user_id'):\n",
+ " item_sim_list = get_item_sim_list(user_df)\n",
+ " plt.plot(item_sim_list)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "这里由于对词向量的训练迭代次数不是很多,所以看到的可视化结果不是很准确,可以训练更多次来观察具体的现象。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 总结\n",
+ "\n",
+ "通过数据分析的过程, 我们目前可以得到以下几点重要的信息, 这个对于我们进行后面的特征制作和分析非常有帮助:\n",
+ "1. 训练集和测试集的用户id没有重复,也就是测试集里面的用户模型是没有见过的\n",
+ "2. 训练集中用户最少的点击文章数是2, 而测试集里面用户最少的点击文章数是1\n",
+ "3. 用户对于文章存在重复点击的情况, 但这个都存在于训练集里面\n",
+ "4. 同一用户的点击环境存在不唯一的情况,后面做这部分特征的时候可以采用统计特征\n",
+ "5. 用户点击文章的次数有很大的区分度,后面可以根据这个制作衡量用户活跃度的特征\n",
+ "6. 文章被用户点击的次数也有很大的区分度,后面可以根据这个制作衡量文章热度的特征\n",
+ "7. 用户看的新闻,相关性是比较强的,所以往往我们判断用户是否对某篇文章感兴趣的时候, 在很大程度上会和他历史点击过的文章有关\n",
+ "8. 用户点击的文章字数有比较大的区别, 这个可以反映用户对于文章字数的区别\n",
+ "9. 用户点击过的文章主题也有很大的区别, 这个可以反映用户的主题偏好\n",
+ "10.不同用户点击文章的时间差也会有所区别, 这个可以反映用户对于文章时效性的偏好\n",
+ "\n",
+ "所以根据上面的一些分析,可以更好的帮助我们后面做好特征工程, 充分挖掘数据的隐含信息。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- ],
- "source": [
- "# 随机选择5个用户,查看这些用户前后查看文章的相似性\n",
- "sub_user_ids = np.random.choice(user_click_merge.user_id.unique(), size=15, replace=False)\n",
- "sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]\n",
- "\n",
- "sub_user_info.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 上一个版本,这个函数使用的是赛题提供的词向量,但是由于给出的embedding并不是所有的数据的embedding,所以运行下面画图函数的时候会报keyerror的错误\n",
- "# 为了防止出现这个错误,这里修改为使用word2vec训练得到的词向量进行可视化\n",
- "def get_item_sim_list(df):\n",
- " sim_list = []\n",
- " item_list = df['click_article_id'].values\n",
- " for i in range(0, len(item_list)-1):\n",
- " emb1 = item_w2v_emb_dict[str(item_list[i])] # 需要注意的是word2vec训练时候使用的是str类型的数据\n",
- " emb2 = item_w2v_emb_dict[str(item_list[i+1])]\n",
- " sim_list.append(np.dot(emb1,emb2)/(np.linalg.norm(emb1)*(np.linalg.norm(emb2))))\n",
- " sim_list.append(0)\n",
- " return sim_list"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAACJRklEQVR4nOydd3hb5d3+P8/RXtawPOVty1kkhCwSNoSEpMwWymrpotCWskrnr5PSvm/X25ZCoZQOoItVyigbwiaMBBIy7STeey/Z2jq/P47lxImT2ImOEiv6XJcv29LROUeyfOs53+d+7q+QZZk0adKkSTP9kY70CaRJkyZNmsSQFvQ0adKkSRHSgp4mTZo0KUJa0NOkSZMmRUgLepo0adKkCNojdWC32y2XlJQcqcOnSZMmzbTkgw8+6JZlOWui+46YoJeUlLB+/fojdfg0adKkmZYIIRr2d1+65JImTZo0KUJa0NOkSZMmRUgLepo0adKkCGlBT5MmTZoUIS3oadKkSZMiHFTQhRB/FUJ0CiG27Od+IYS4QwixSwixSQixIPGnmSZNmjRpDsZkRuj3A6sOcP9qwDv6dS3wh8M/rTRp0qRJM1UO6kOXZfkNIUTJATa5EPibrOTwviuEcAgh8mRZbkvUSe7JyIcfMvzOO2gzM9G4XOO+SzYbQojDPsZgt5+aDV1IksBk02Gy6Ue/dJisOiRNulKVJk2ao49ELCzyAE17/N48ets+gi6EuBZlFE9RUdEhHcy/YQPdd/5+4jt1OrQuF5pMF1pXJtpMF5pMt/I9/vvYdxeSwTD20Eg4Su3GLra/3UZzVd8Bz8Fo0Y0TerNNhyljD9G36TGP/qw3aff7ITMwsJHevrcRSAihQQgNiNGf0SBGf0ZIo78rtzH6fZ/b9tjP3rcp+xCAACFGf5ZGz035EkLa4+fJ3CaNPTchtGg0psn/IY9WIiGofgaiEdCZRr/Mu7/rzbt/1xohAQOINGkSRVJXisqyfC9wL8CiRYsOqbNG5tVX4/rMZ4j09RHt7SXS07P7e08vkd7R7z09hGprifT0IAeDE+5LsloZzptJq3sJrQYvYfSYdSHmFgWoqNRjKSsi4vYQDAr8QyH8QyFGhsJjP/uHwvS2+mgeChEcjkx8DK0YFXf9PqP9Id3Xicj1h/IyTJpYVIuvdR6SJoykH0GjH0ajG0HSjyBpJj7nQ6Ws9GuUll6f0H0mFVmGZ74GG/4xyQeIPUTfMsEHgAn0ln1v01lo7XGQMe8UrIc4sEmTZiISIegtQOEevxeM3qYaQqdDl52NLjv7oNvKsow8MkKkt5doTw+R3l6G23upq4lQ02lhIGxBkqPkjOwkr+Nd7E0fImJRBoABAElCX1yMweslo7ISg9eL4UQv+uIKhEYzdpxoNEbAF2ZkcLfY7/MhMBiir22EkaEQkqEdw9kNdOxYgbX6IkwZWlx5Rlx5Zhx5Bhw5RhzZRnQGgSxHR79iyERBju2+jRjE75OjRCMhOup30bztI1p3bCfoW0g0ePaEr41GK6MzyeiNynedMYbOKKM3xUZ/jqE1RNGbZHSGKFpTFL0xis4QRdLFFAFERiZGf9/71NbdjsOxGKfzxIT8nZPOe39UxPzkm+GEqyA8DGE/hEdGv+/xc+gA94VHINAPQ23j7w8NgxylIbiAZ/q+S8WHr7Lyh5890s86TQqRCEF/CrheCPEQcCIwoFb9PE406p/05b0QAmGxoDOZ6RiysL1WS+1GiWgkRlaRjdNOysO7OAejZQVwHXIsRnRggEhnF6G6WoI7dhDcuZNgdTVDL700KmIg9Hr0FeUYvV4McaH3ejEX5B60ji/LMjW1f+IL7xgYtL/LDavOpKhvHr0tPra9OUgkFBvbNsNtxJVvxZVvIdNjITPfiiPHjEa7u44fi0Vp2b6VqrVvsOO9tQSGBtGbzFQsXk1b/fGYM7UMdT6Dr3+YisWnUzxvCaFAjOBIhOBIeNz34W7l51AgesDnoNFKGMza0S8dBstC9J4QW7fdwolLnkWns0/q73PUUPMqvPBdmHEuLP8RSOrMk/Q29fPirzciE6Oly4ksywmZ90mTBkAcrKeoEOJB4AzADXQAPwJ0ALIs3yOUd+PvUZwwI8DnZVk+aOrWokWL5EMJ52ps/Cu1dXdw6invotEYD7r9YI+fqnfaqVrbxlBvAINZS+WJucw6KY+sQtuUjh3z+wnW1CoCHxf6nTuJdHSMbSPZbGPibvB6MVQq37VO57h9/eW187i9oYEsUxZ9wT7uOPMOTi04FTkmM9jjp6dlmN7WYXpaffS2DtPfPkIspvytJElgzzFhsvQQ9G2np2kjAV8/WoOB8oUnMvOk0yg5fgGtO4f4750fsfLqORQfZ+OV++9l62svk1tRycdu+AbO3Pz9P9dojJA/SmAvwR/7eXj3bYGRMH0NnYRjWgrP+gkFFfM4bs4d00eoemrgT2eBLQ+++BIxnR4h9Ak//4AvzKO/WE84GGV25od8UHccn/rxUhw55oQeJ01qI4T4QJblRRPed6SaRB+qoP+u+l1+2yKzdu4I+VnLJ9wmEo5S91E3299upWl0grNwppNZJ+dTerwbrU4z4eMOlejAwJi4B3fuJLBjB8Gdu4gNDIxto8lyK6N5byWaGR6uC/+M9qiVJz7xPF9+6cvU9Nfw++W/Z1n+somPEYnR1z5M3cbt1Kx/m47aD4gE+wENkq4UjX4Geks5mflOXB4rmfkWajZ00dc+zGf/9yR0BuVirPqdt3jpT3cSi8Y46/NfYs7pyw9fuGpeYej+a/h3z6/AbCD/rO8x94Tvk593yeHtNxkEBuEvK8DXAde8Qkd4G9u2fxNZltHrM0e/3Oh1buX72Ffm2M86nXN0onj/RKMx/nvHRtpqBvj4LQswbrqXfz6/kDOu8DLn9MIDPvZwkKPRcaXBNNOflBL0f9ft4hdVO/nfgvWsmPuDcfd1Nw+x7e02drzXTnAkgs1lZOZJecxclktGZnIdGLIsE+ns2mc0H9y1i7rFw3z7FD0XbY5xXcblGL51A1e/eDVNg0384ew/sCh3/N+qp7mJqrVvUL32DfraWpA0GornncCMZadSNHcRIwNCGcm37B7RjwyGAIhKfvqzNnLDjTfgzMwAYLC7i+fu+jXN27ZQuexUVnzxqxit1kN+rsM/XUn3mkZCFWW8or0Bs6sHz1m/Yemy/2A2lxzyflUnFoOHroSdL8JV/2EoK5v1H3wSq3UGTucyQqHuvb56kOXwPrsRQoNO55pA7LPGbvvwv1p2vDPCWZ+dwaxlHuQP/sb9fzHjmVvEyq8sSfhTk8NhWr/9bQJbt1H29H8ROl3Cj5HmyJBSgt744v/iWfsr6rJyKDv3z4TcS9i5vpPta9voahxCo5Uom+9m1sn5FMxwIqSj67Jfjkb57nOrebanjT+/XIGzzUfFyy/R4+/h8y98no7hDv644o+UyLlUj4p4V2M9CEHRnLnMOOk0vEtOwmTLOOBx3nmihg+fb0CUtNAZqOH8FZew8OTjxu6PxaKse/Ix1j76TywOF6uvv4XC2XOn9FxCzS10/uT7DL3+LkEdGMLgX3g879i+iNOzkYpz3mbRokeQpKNUTNb8BN78P1j9S0InfJJ16y9ClqMsXvwkBr17n81lWSYSGZxA6Ee/wj3jfo/FFHdV384z6NjwKVwznif7+MfR6ZxkDWlpevEcWrWn8tlfnZnQ8o4cDtNyyy0MvfQyAIV//jPWU05O2P7THFkOJOhHrMHFoSLk43j9o3ksrKhGuv9cBiMVtPnOB/dKTr2sksolORgtR6mAAMGYn9f62ljkzKV44en0/OnPyOEwmaZMbl/8S6555Ut88enPseLdLNwDBvIrZ3Hm575E5dKTsTpdkzpGLCaz8/0O8isdbPe/C0BnW9e4bSRJw4kfv5TiufN55s5f8cht3+XEiz7JskuuRKM98NsiNjxM95/+RO9f7yMqh3nsVMGzS/VcsSbCyg8+Ys7xj7OVi2l4q5XMzN9RXv6NQ3ux1GTLY4qYn3AVsUVfYMumzxMKdbFwwcMTijkoE+w6nR2dzo7FUn7A3cuyTDTqo35LE9UftZM3I8bii+cRjuQTCnXTP/IE+fot7Bw8lYFOf8Lq6HIoRMvXv87QSy+T/Y2v0/2Hexh87tm0oB8jTDtB3/Hkyzhr+miuctLkzadwdoiVjt+C6SHQXgPi88DkhO9I8OT2e/HF4JPei9CHcwkK+OCxh9m1fTMtVVs53WjixVNGePWUAX5/8u0sKj9pysdo2tbLUG+AspMsfLReKb309Ey8WCq3opKrfnEHr95/L+89/ggNmzbwsRu/OeGEqRyLMfj003T+36+JdHbSsayMHy1ooCwzk0fP+wefsV4F2TIrnn+FoflZNO44j81r7sPleu/osjK2boQnvgqFJ8K5v2ZX7S/p63uH2bN+RUbGvIQcQgiBr0fDa3/rxplr4byvLERv3P3vtnGkiXzD+wC07OhLiKDvKeY53/0urs9cRWDHDoZeXoP8ox8h9PrDPkaao5tpt4b9pVMNXP8VmdfPOAGp3k/rUzFaOy4kpCmDNbfBb2bDf2+Grh1H+lQn5NFdT5GtlVnu/RybmutYM7uY1x9/iOCwj5Mvu4obfvFXHrr8cWxmO7es+w47+3ZO+Rhb32zBZNPRF2tCp9OhkQ0MDPbvd3u90cQ5X76J87/2Hfrb2/j7t25ky6svsWc5zr9pEw1XXEnrt76NJiuL5759Kjec0cgp0gj3nvMXCjMK+dFJt/LnE3rZ9uXllG15jMzBbbR+eBXvv3wH4fDAfo+fVHyd8NCnwOyCy/5BW9ezNDX9lcKCz5GX94mEHSY4EuaZuzchhODc6+aNE3MAo6UEo7kTs95Py47+wz6eHArRPFpmyfne93B95ioAMlavJjYwwPA77xz2MdIc/Uw7Qb/5vFsYMmv405lWOm4LYb18NYNrN1Hzp0ZaBz5NKP9c2PgvuGsx/OMSqHllzDt+pKnqqaJ6qIcV2SUIWc9HH75Lps/Px5efz2f/7y6WfuIynLn5FNgK+MvKv6CVtFzz4jXUDdRN+hi+viD1m3uYuSyXHTurqaiowKS1MewfOuhjK5eewlW/vJPcci8v3PM7nr79F/jq6mj99neov/QyQq0t2G/7Pj/+gpn7pHe4YdDPT7NOQ5dZAcAZhWdwftn5/NT1FuIX32FO9d8x+ztpeOVyNqz9H47UfM0YkRA8fBWM9MDl/2Iw1k5V9XdxOpZSUfGdhB0mFo3xwp+3MtjtZ/WXjyPDve+EvNlURMAkyLfU0rqz/7BeGzkUovlrt+B7eY0i5ld9euw+y8knI9lsDD73/CHvP830YdoJeq4tD5d2IWH/OvotGkKfyqX8pRdxfupKBte8Rc1v3qdt+LOE5t4IbR/B3z8Ody+DDx5QVusdQf659V50QuYi78U0bfmIUCBAaZ8P65Bvn22LMor48zl/Rkbmiy98kabBpgn2uC/b17Yix2TcXi1DQ0PMnDkTqzmDQHR4Uo/PcGdxyQ9+yqmf/BSx/z5D/XnnMfDMM2Recw26h//INdp/saV3K7/KO5tre7oQJ9847vHfXvJtXEYX348+Rsl993JCzf0QirHt4fnU7/r3pM5BFWQZnv06NL0LF91F0F3Aps1fQa9zc9xxdyR04vbtx3bRtK2X06+cQb7XOeE2JlMhfqNEvuYDhvuDDHQd2ntzTMzXrCHn+98fJ+YAkl6P7eyzGVqzhlgodEjHSDN9mHaCvu2NjeSFlwJRXh/Op7PrJXTZ2eR+97uUv/QSzssvZ+CZF6j50RO0DV5G+JSfg6SF/94Iv50Dr/wUhtqTft6+kI/nG17lBFOU/L4KOn/7W07b0Uz2kJ/Bl15mZN065Mj4bJUyexl/XvlnQrEQV794Na2+1gMeIxaT2fZWKwUznTR11iKEwOv14rA7iIkQI76Di4Ysy/heXoPrnr9S2drNQKaT17we/pszzKdfvwZf2MdfVvyRVVtfguJTwDM+/t5usHPrSbeys28n90ffYNaD97Gg+e/4wy7eub2Hwf6aqb94ieD9P8GHf4NTbiE2+wK2bLmBcLifefPuQa/PTNhhtr7ZwqZXmjn+rEJmn7z/hVtGUxF+o4YClDp66yGUXfYR809/asLtMlavIjY0xPBbb0/5GGmmF9NO0AM9wxiGYgTNi3lzYIBe3y6Gh2sB0OVkk/v97ykj9ks/ycATT7Hra3+grWMF4Y/9XZkEe+P/4LfHweNfhrZNSTlnORrllf/+nsteDHLjbzW0XHEt1vc/RONyobHZCDc00HDVZ9h58im0fvvbDD7/AlGfMqL2Or3cu+JefGEfX3jhC7QP7//DqHFrD76+IHNO9VBVVUVxcTFms5lMtzJJ3Fzfsd/HAgSqq2n83OdpufEmJJOJovv+yuIXX6L3YxXcEX4E3VCUuxf9mvld9TDQBCfdMOF+Tis4jQvKL+Avm//CTtMg8x+8l3nd/6YvWsma7z1GNJrkkWLt6/D8d6ByFZz1A3bu+h/6+99n1syfYbPNSdhhWnb08caDOyia7eKkiw/sgjEZCwgYNTg0LZisEi07DpzwuTdyKETzzV9TxPwH+xdzAMvSpUh2O4PPPzelY6SZfkw7QZ81exYLh+34M84jRIS1Pi1dXS+O20aXm0vuD39I+Ysv4Lj4E/Q/9hg1V3+f9p2zCV/+Aiz6Amx7Cv54Ktx3LlQ9A7EDZ5dMFTkUwvfmW7T98EfsPO10vN97gJUfyhi9hRiuv441c0qw/s9t2Fadg7DZ8Pzud1jPOB3fa6/TcvPN7Fy2jMZrrqXvwQepiLj449l/pD/Yzxdf/CJdI10THnPrm62YMvTYCyW6urqYOXMmADn5WQC0t3RO+LhIXx9tt95K3cc/QbCqipwf/oDSx/+DaemJ3LPtTzykf5051plcsK6QNT/+Gf4XforsrgTvyv0+/28v+TaZxky+//b3idmtLPvH75gx+Byt0UWsuf6nyaun99bBo5+FzAr4xJ9obX+M5ua/U1T0RXJzL0jYYQa7/Tz/xy1kZJlY+cU5B83M12qthK0OhABPfpiWHZOvo4+J+SuvKGL+qf2LOSi5Q7YVZ+Nb8wqx/SSPpkkNpp2ga11GzvblE9GXYooV85rPSGvnxBM+urw88m69lYrnn8N+0UX0PfIINZdeS/sGB+Gr3oCVP4X+BmW14J0L4d17IHjwycP9EfP7GXzpJVq+9S12nHwKTddcw+DTTxOcV8FvL5J47kdR8n//G+qtemSTiZL5C9EXFCIPDmJZeiL5v/gF3rffovjvf8P5qU8Ramig/ce3sev0M7B++cfc27wCQ20b17zwRXoDveOOPdQboGFzN7NOymPHzmoAZsyYAYCnOAeArs6ecY+Rw2F6//Y3as5ZRf+j/8Z55ZWUv/A8riuvJCSifPuNb/OnzX/iE95P8MDF/+Kan/+BueUWTL46PvKV4R/Zf10+Q5/Bj076Ebv6d3HPR/cgWSycdd9PKAi8y075DN655hvI4X1XXSaU4JDyt5VluOJBBoK1VFX/EJfzFMrLvpmww4T8EZ65exOyLHPudfMwmCdZj3eWAuDJ7GG4P8hg9yRKYqEQzTfdPGkxj5OxajWx4WGG33xzcueWZloy/QTdaSQjKrDHQK85jcFojFfatxEI7r8UofN4yPvJbZQ//xwZF5xP37/+Rc35l9Dx+jCRK1+GT94Plix4/tvwmznwwvegr2FS5xMdHGTgqadovuEGdiw7iZYbbmT49TewrVhBwR/uxvvOWv5+eTYfHadjSV42FssMdr7/DsXHL0BvNKErLAAg1NwMgNBqMS9eTM53vk35C89T9vR/ybrlFiUy+L7H+OmfRvj6z3byxHXn0fHai8ijE13b325FBuackk91dTW5ubk4RwPBnFkZCFlDX+/uy3rfm29Se+FFdPzvzzAddxxlTz5B7ve/h8bhoMffw9UvXM3z9c9zy8JbuHXZregkHbZMN2eUDxPWZvDG5mH+9s3radyy/7LVaQWncVHFRfx1y1/Z2r0VyWDknN9fS2a0io3SSjZ8/gZiw5ObrJ0ysZhSVuuqgk/eR9BqY/Pm6zAYcjjuuN8hSYlZghGLybz01630tY9wzrXHTclPrrOXE5Uk8s27AA5qX4zFxfzVV8n54Q8mLeYAlqUnonE40m6XFGfaLSwSVY+j0eopjGQQtMzF1WfjlaEYn2p/jtLizx/wsfqCAvJ/+lPc115L9x/uofcf/6Dv4YdxXnEFmV98CK2/Dt69G979A7xzFxgydneo0ZuVJgZ6M5GgjqFdfoa29TG8qxdiMlqHGccpM7EtnYN57gyEKQN0Wrpb3ualhhc53RSi0H4ynTu24Ovp4pTLFJ+wvlAJZgo3NWOaM76eK4TAUFGBoaIC97XXEOnuxvfaa4SefYwF72+kd+1N9FktWE49na3yxyj02pGMURobGzn99NPH9iNJEnphZmh4gGBdHZ0//wW+119HV1xEwd13Yz3zjLGl57X9tVy35jp6/D389ozfcnbxHlnqHdsQu15Gd+b3uezKj/PMHb/i0Z9+j8UXXMzJl34KjXbfkek3F3+Tta1r+d5b3+OR8x/BaHWz/EcLePYnu1hn+Bj6T32RmX/5PdrMxE1MAvDaz6DqaTjnZ8RKT2Hzhk8RjgyyaNG/0ekcCTvMu0/UUL+5h9Mur6Rw5tQWtJnMRfiNAkd4Cybb6bTs6NvvRGosFKJlTzG/8sopHUtotdhWrmTg6aeJ+f1IphToLpVmH6adoBMYRBv1UTCSzyanmxPXu3lr5hD/2fQAXz+IoMfRFxWR/7P/xf3lL9F99x/ofeAB+h56CNenrsR19a/RrrgNPnpIWYQSHobQCOGufoY+7GKwqh1/izIq1tlkMmeFsOUPYnSFEWIX7HoWdu0+1uP2DCIuBzft6KB06z3APXxtJoi3NsD7FvQzlDpuuPngtkSt243jkktwXHIJb+x6mQceuIXljUaKtvczUiIof/K31L8dxGvQ471gfH04Q+gpfO8Vah/8A5JeT/Y3v4nzqk8j7bF68N22d7nl1VvQa/Tct+o+jnMfN/4E3rkLtCZYfDU5ZhdX/fx3vPq3P7HuyX/TuHkjF37z+9hc45fNZ+gzuHXZrVy35jru3ng3Ny+8mayCpSz5wjreus/K+xkXIF32KSr+8kf0xcWT+vsdlK2Pwxu/hPmfgqVfYUf1DxgY+JDjjrsTm3VmYo4BVL3bxoYXGznuNA9zzyiY8uNNxiL8RglzXw2eSieto3X0vXNdYqEQLTfehO+118j90Q9xXnHFIZ1vxupV9D/yCL433iTjnP3Pf6SZvkw7QX+338EMsZO8gQhrnAZK6mJsm6VnzVAXKza9w7x5E8fPToS+uJj8X/yczC99ie4//IGev/yV3n89iOtTn8L1hauJ9vYy9NLLDL30EoGtVQAYZszAff0KbCtWYKj0Kv98sry7U01oePT7CNHQEI++9wPmEyJYbidWfAsfPvEwFouBWQuXQPcOpA//TIa3kFBT85Reh9MqziZ8zW/5+mtf55O7TiPPJ6g8fyGtTz7Ogq5uhq78FLWVlVjPOhOtK5NTn7gfrX8E+8UXk/21m9G6xwvvYzse46fv/pQSewl3Lb+LfOteI8Whdtj0MCz8nLLKEtAZjay89gZK5y/k6dt/wcbnn+bUKz+3z7meWnAqH6/4OPdtvY/lRcuZmzWXmUuuo7vtBrY8fyEbsj+OdPkVFN/7R0xzpxYQtg9tm+CJ66BgCZz3W5pbH6Sl9UGKi79CTvbHDm/fex6mZoBX/1GFZ4aTUy7zHtI+TKZCfEYNdDWRf7KDXR90Mtjtx561u2yTKDEHMC9ejCYzk8Hnn0sLeooy7QS9SeQRJki+TyYsg2X2fJb2wbOOBh5971fk5t5D9iRa0+2JoawUz69+qYzY77qbnj//mZ777oNRX7hp/nyyv/lNbCvORj9RD0ghlJKM3gyW3UL5ZtNrtIX6OS/HSNjzMXod5/N6/fOc/cXrYMXHlJWL95xC9txG2psnvxo0zvKi5fzk+J9Tv1ZL3YzNmL/0RZ4NjHBiURELojF8r7xCz71/gliMcGEpr86q4PLrvobWvbu8EZNj3P7h7dy35T5O9pzM/532f1j1E0TpvvdHiEVg2XX73OVdchKOnDz62vbvk4+XXr7/9vd55PxHMGgMLF71fYa6/x916z/L9oLzkK+6isI77sB62mlTfi0AGO5WlvUbHXDZP+j3bWbHjtvIzDyd8rKvHdo+J2CoN8Bz92zC6jSy6trj0BzE0bI/TKZCuowapPAwniJlHy07+scEPRYK0XLDjfhef53cW3+E8/LLD+u8lbLLCgaeeJLYyAiSOd1YI9WYdpOi73Qb+Fz0DBw+RWyNJ5yI690YNo1gs6mBhx56iEAgcEj7NpSX4/nNryl76kmcV15Bzg9/QMXrr1Hy0INkXv2FicX8ADxc/TBuo5OZul7c7uXsfH8tCEH5oqXKBlo9nPcbdHo/FvH+IZ1zdkMlEoI1lsf46otfJRgJUn7KKWR+/nMU//1veN9+i5JHHyHwzdvoc7lobdztRfdH/Hz9ta9z35b7uGzGZfz+rN9PLOZBH6z/C8w6H1xlE56HIy+fvvb9C7pNb+PHJ/2Y2oFa7t54NwBGYz4LL7iMvMrHac1cSnPpmTR95Sv0P/afqb8QkRA88hkY7oTL/0lAL7N5y1cxGj3MmX07QiSmyUMooDhaouEY514377CSPQ2GXAImpeTl1LdisunGFhglWszjZKxajez343v99YTsL83RxbQT9HPn5RNFom1YqWPL5TPRxiSWm/OpiYao9+/i8ccfJxaLHWRP+8fg9ZL73e/iuvJKdDk5h7SPpqEm3m55m+XZxWiEhNt9Bjvff4f8ylnjY3BLTsGvPR5nfity25YpHSMWjbH97VaK5rj5xpk3sXFgI+vy1pFfsLtconU6Mc2dS15RLgAdozG63f5uvvD8F1jTuIZvLf4W3zvxe2j35/zY8A8IDMBJN058P+DMzaO/ve2AXuqTPSdzsfdi7t96P5u6FHdMTvbHmHGOicz8t9mRcz79lYtp+9736P7DH6bmVX/uW9DwNlzwe6K5c9i8+TqiUT/z5t2DTnfg7PjJIsdk1jywnd4WHyuvOQ5XnuWw9ieEhOzwKD/3N5LvddKyo49oMLiHmN+aMDEHMC9aiCbLzeCz6UVGqci0E/QzKrNw4mNzOIwG6DFaceUXMLs5F6OQ6a1ooLq6mjePsN/20R2PIgmJRfouHPaFDPcG6aqvxbtk3xp/sOyzxMIC+cmbphQkVr+5h+GBEHNOzefCsgtZPLCYFmML3137XSKx8TECnuJskKGnu5cdfTu48pkrqRmo4Xdn/o6rZl+1/wYL0Qi8excULoXCxfs9F0euh0goiK+vZ7/bAHxj0TfINmfz/be/TzCqLHKZMeOHFJz1OnZnDR/lXE5k3iy6fncH7T/+MXJ0Egu+1v0ZPrgPTr4Zee4lVO/4EYODHzF79q+wWg6tvj0R7z9dR+2GLk6+xEvxnMS4coRz9IqnvwFPpQNfX5AdN/4/Rcx//GOcl1+WkOOMHU+jIWPlOfjeeGNsNXKa1GHaCbpGIzFT1846wuSGoDEQomLJMtre7eFUm4YPfbV45nh49dVX2blz6tGziSAYDfL4zsc5zbMUXXAn7qzl7Hp/LcCEgq4rnU3nRxlI7evhowcnfZytb7ZgcRgomZtJY2MjRb1FfKbwM7zU8BLffeu7RPdY/Wow6dHIRjb7N/CZ5z5DNBbl/lX3c2bRmQc+yPanoL8RTt7/6BzAkZsHQH972wG3s+qt/HjZj6kbqOOuDXcByqrJufN+Tfbpf8BgGeb9zKswnOSl/6GHab7xJmL+Ayy4qX8Lnvs2eM+B5T+kueUftLU9SmnJDWRnnXPg5zYFdq7rYP2z9cw6OY95Z03d0bI/DLZywjoJua+BvBJlxN9a3aeI+WWXJuw4e5LxsdXIwSC+115TZf9pjhzTTtABZtgjBBDY2kdo8AepWLwUOSpztm0mGqAlr4mcnBwee+wxent7D7q/RPNi/Yv0B/tZkaX842e5z2bn+++QXVKOPTt3n+11hYX015qJmMrhxe/DyMHPebDbT+O2XmadnIekkaiqqkKj0XDjyTdy84Kbea7uOX609kfE5N2lpwZnA8/an6DQVsg/z/0nszNnH/ggsgxr7wRXOVSuPuCm8YYYB5oYjXOS5yQu9l7MA9se4KOujwCwZxxP5cwvknvar4npTLxrugTnuTPwvfIKjZ//ApG+CbJO+uqVOFxXGVz8J/oG1rNz509xu5dTWnrgD6Cp0FE/yJq/bSevws7pV8xIaLs4k6kQv0Ei2rWT4Z99B11oiOAZn1RNzAFMJ5yANjubwefSZZdUY9oJ+vBAkGJzPlY5QqTDT4M/RG6ZF6srk1idncWWME/V/ZezL1QWxDz88MOEkhwb+nD1w5RklOCJ7sRsLiUWtNO6Y/uEo3NQsmfQaBmQVoC/H16+9aDH2PZ2KwKYfXI+sixTXV1NWVkZBoOBq+dezXXzr+PJmif56bs/JRKL8Mt1v+Q911vkjuTxwKoHyLXs+8GyDw1rofVDOOl6kA78VrG53Wi0WvoPMDG6J99Y9A1yzDl8/63vE4gok9jFxV8iu7CY/FPvYUDO593wGeRetYDAtm00XPkpQs0tu3cQ9MGDVyoZPJc/SIBhNm+5HpOpmDmzf40QiXlr+/qCPPuHTZhtelZ/aS4abWL/ZUymIkZ0Ei0P1jLy+hvkevR0heyqZt0IScK26hyG33iDqG/f6OY005dpJ+hbXm9hZFMOn/XpcHSG6QmEGInJVCxeSu3bXSy3a4nEIjzb/iwXX3wxHR0d/Pe//01aGFRVbxUfdX3EJd4L6e9/D7d7ObvWKX09vSdO3E5OaLXo8vMJtIVh6Vfgwweg8b39HiMajbH97TaKjsvE5jLS0dFBf3//WBgXwJfnfZkvzv0ij+54lPMfP5+/b/s7iyLLWNa5FBGe5J997Z1gzoTjD+59liQN9uzcg5Zc4lj1Vn580o+pH6znro1K6UUIDXNm/xprTgMlp75CS2geG7qKKLzuNCI9PdRfcTkj69crcQdPfBm6tsMn/0rUWcCmzV8mFgsxb+49aLW2yT2/gxAJRXnunk2EA1HO/eo8TLbEt3AzSrkMPG9npAlyf3wrpcvn4usNMtRzaE6tyZKxejVyOIzvlVdUPU6a5DLtBH3BqmIKludhk6Os9un56jMDvP1SAyXHLyXsD1NoOI75Fg0PVz1MXnEeZ555Jps3b+b99w/NFjhVHq5+GKPGyMlOJ7Icxu0+m53vr8WZX4DLU7jfx+kLC5TFRWf8P8jwwNNfg+jE4VX1m7oZGQwx51TFIVFVpSx6iodxgRIbcOMJN/K5OZ+jbbiN7574XS53XoVA0HKQGF1AaeG34zlYci3oJrdM/GDWxb1Zlr+MT1Z+kge2PsDGzo2AYmWcOeOn6HMepnxZG9v9Z1O1y0fJN89FaHU0fPoqqhcuoO5379DafCa977ZT9Z+v4OvcxnFzfovFMrGtcqrIsswrf9tOZ+MQK74wm0zPBHbOwyQWDNL77duJ1enIXdyPc9XJ5HsdAFOO050qpuOPR5uXl3a7pBjTTtB1eg1LVpawzPEXnjIHGZZkdj5Zzyv/GMKYcSZ9O3M40zrMUHiIR3c8yqmnnsqMGTN44YUXaGiYXODWoTIUGuKZ2mdYVbqK4MA7aLUODJoKmrZuwrtk2QFrr7qCQsLNzWCwwupfQudWJVNmAra+2YrVaaB4jmJ/rKqqorCwEKt1vOgIIfj6oq+z9oq1XDHzCnLylUVP7S0Tx++O453fg9YIi784yWc/Oevi3nx90dfJs+Txg7d/MFZ6yck5l7zci9EW3ErxPA3v+q6i6YO1lP74cvJvugxn+QAaVza+Le10/PR/kH7wLnlf19F/xc9pvuEGuu66i6E1awi3tBzyldkHz9Wzc30nyy4qp/T4rEPax4GIBYM0X38DI2+tJfoJgbN8BPoacOVZMFp1h9TwYioISSLjnHPwvf020cFBVY+VJnlMO0GPxCK4LTr6tWGsBsE/7SE0nymlcKYTNPOpX3ceuo2fZal0HH/f9ncicoSPf/zjOBwOHnnkEQZVfPP+t+a/+CN+Lq28hJ6e13C7z6Duww+RYzG8SyYut8TRFRYQ7etTapozz1UmIV/7OfSPz3gZ6PLTtK2XWSfnI2kk+vv7aW9vH1du2RuzTlkR6ClW6uZdHd0HfiK+TiXLZv6V41a+HozJWhf3xKKz8OOTldLLnRvuHLu9svKHmM0F2I+7lZwSMy8P3kL3C3/G3nMPOedWUvT4q2Q+/Us6fhYl8v/m4b75JozHzSG4cxfdv7+L5q9ez67lZ7PjxKU0XPUZ2v/3f+l/7D8Etm07aCu2mg2dvPdUHZUn5nDCyqktJosTCwYJt7Tg37SJoVdepe/RR+m+54+0/8//0nLLLdRfcgnDb75J3k9/QmzlaIZNXz1CEni8joQ0jj4YGR9bDeEwQy+vUf1YaZLDtFv6/0ztM9zz0T3MdcGZ7TE+GJHZoony5S/N5aM1b/Pa399lqGkx8+tPwmXfzr+dz3HF2Rdw2WWX8ec//5lHHnmEz33uc2i1iX3qsizzSPUjzMmcQ4EuRGe4D7d7Oe88sxZbZhY5ZRX7POaf7zUQjcl8ZlnJ7tTF5mY0M2fCx34Jd52odNq5/J9jj9n2VitCwOyTFZtgdfX47PMDkZljV2J0J3KM7Mn790I0BEu/OtmnD4y3Lu4d0nUgluYt5dLKS/n7tr9zdvHZnJB9AlqtlTmzf8sHH15K6fJ/43/i4zw78AMutv0K00V/Y7i/nQ0f/AitZSHZC35HTDYiLY+iC8fA5yfQ2EKguY1ASwehji6Cb/UQe/1topr3iWn0CEcmOFxgc4DFhmy0EENDJBSlp3WYnNIMzvz0zLGrKlmWiQ0PE+3pIdLTQ6S7W/m5u4dIbw/R7tHbe7qJdvfsNxZYslrRZmaicbvJ/9WvsJ9/Hi2b30BmLaJfuYLMr3RSs6GLwW7/hA2mE4Vx7lx0Hg+Dzz+H4xMfV+04iaDvwQeRQyGcV16J0CWu/2uqMe0EPceSQ6Ypk+ecH/HT3j4IOdi4qxNOrGD2KQt57f7fUH5yPX29ENyxkr7HLDz83vuccHYR5593Af95/DFeeOEFzj333ISe1wcdH1AzUMNtJ91Gd/cahNBhMy2i/qM/cfzZq/cptwTCUX7+XBUui57PLCtBV6AIeqipCePMmeAogtO/pTheqp+DGauJRmJsX9tK8Vw3VqcRUMotbrcbt/vgAipJEjphZtB3gKuU0LCyUGfmueDe90PoQOxpXSycPbWQrVsW3cLbrW/zg7d/wKPnP4pJa8Jun09p6U3U1v6GEy8/hdf/6uQf9bfCD6pHH/U9ALayvxW2GcqX2QujA21JkpGIoYmGEMNBpIEhpFgfUjSERiuhNRsosOqYM/gmbTf+lUhPD9HubiI9Pcj76fajcTjQuDPRZroxzTkOTWYm2sxMtO5M5We3WxHxzEwkg2GfxxstpQQNEoa+OgTgqXQASq6LmoIuhCBj9Sp67n+AaH8/GodDtWMdDsGaGtp/8lOIxeh//AnyfvITTHOPO/gDj0GmnaAvzVvK0ryl/N8//xerth2zNpOm+j6+99b3+Mzsz1By/Ak0rttC6fnrCC7U89RbA5wzcAUv378di8PAzIKTWf/eu3g8HubPn5+w83q4+mFsehurSlexcf3dOJ1LadpSTTQcnrDc8lp1J0OBCMPBCMFIFP1oo4vwnqmLy66Hjx6GZ78FpadRt3kY/1CYOacqwun3+6mvr+fkk0+e9Hma9VZGDtSVaeO/wN+3336hB2LMutgxOafLnlh0Fm476TaufvFq7txwJ99a/C0ASoq/TG/vW7T2/pDVX/03bdU6unr+y4h/M0XFV2J3zkKrk9CMfmm1Elq9Bo129HfdHt+1EkIa/8Ea6e0lWFVFYHsVgeoqgturCDU2Es6wIWcqQqwvKUab6d4t0PGfXZloXc7DHjGaTIX4jRK63p1oQKmjW3S07uhj1kl5h7Xvg2FbtZqeP/+FoZdfxnHJJaoe61DpuvP3SEYjOd/7Hl233079ZZfhuuoqsm68AclyePELqca0E/Q4lZ4zWLTtQUqdC9jaPcJzu97lqZqnWGmaR36rD5N+JpWaHfjLY7yi+zM/K7mTjS830bIliFtaxpp/bMGqd1Ix+/AzuLv93bzc+DJXzLwCOdTOyEgtBQVX8dG/12LKsJM/c9Y+j3lig+IGicnQ2DOCN8eOlJExPhddo4PzfgP3rYbXf8nWrRdhdRkoGl12vmPHDmRZPmD9fG9s1gwGejqIxWJIe3vLY1FlMrRgsdJQe4qMWRcnsbhoIpbkLeGyGZfxj23/YHnRchbmLByzMr73/rm09f8/suYsZ7D2d8wu/xYlxQdZ5ToJtC4X2pNOwnLSgec41ETJRdeQ0dcIgJAE+ZUOWnb2q35s45zZ6AoLGXzu+aNS0APbtjH0/PO4r/sKjos/gW3lCjp/8xt6H3iAoZdeIvfHt2I99dQjfZpHDZOaFBVCrBJCVAshdgkhvjPB/UVCiFeFEBuEEJuEEIkLnt4PWZ5yEB0sGO2S8+mSO7h5wc1stbcTEzIfVPcy7NvGZd7z2NK7hQ53DRd97QQu/e5iyuZnYfDl8fwdu3j2jx/R2XB4E6X/2fkfIrEIl1ZeSle3MsHkzDiV2g3rqVi8FEkan/Q3MBLmlapOFhUrLeJqupTFHfqCgn1z0YtPghM+Tf+b/6a5SuloI42OMqurq7FareTnT9zlZiKcTheyiNHT1b/vnVVPK6svT7pBiQQ+BKZqXdybWxbeQr41nx++/UP8EWXJf9zKODi4kZraX5Od/TGKi6495GMcbZhMhQSMGjTDPRBWnD6eSgdDPYFJ9Rk9HJSyy2qG332XyBFYVX0wun53B5LdjuvzSvMajc1G3o9+RPG//okwmWi65lpavvFNIj2Tn4hPZQ4q6ELJHb0LWA3MBq4QQuy9Zvz7wCOyLJ8AXA7cnegT3ZvSbDs9BDguLIiZtbxePcDVc6/mv1c8j7WsgMAWZXHJxtp7MWlN/HHTHwHIKrKx+trjOefGcgLWVuo2dfHoz9bz+K8/pH5TN3Jsaja3aCzKozseZWneUkrsJXR3r8FqnUXHzi7CAf+E5ZbntrQRisa4ZUUlADVdygSarrCQcNMEnYvOvo1twdUIYsxepjhVwuEwO3fuZMaMGfuOtA9AVo4yum+t7xx/hyzD23eAswRmnjfp/e3NoVgX98SsM/OTk39C41Ajd3x4x9jtOTnnUljwORyOJcye9YuELr8/0hgMOWMxugwof39PpfJh35qEUXrG6lUQjTL00suqH2sqjHy4Ad/rr5N59dVobOMXi5kXLKD08f/gvv56hl54gdqPnUv/fx5P2gLCo5XJKMESYJcsy7WyLIeAh4AL99pGRpmBArADhz5EmyQ5GQZaJCjwy8SyjWxp6GcwEEan0bH09PORWvWgyWOJVYM/4mdd+zpuefUW6gfqAfDOLuGsK+fQnfkOWfNiDHb7eebuTTx423tsfbOFSHgSKX/AG81v0D7czmUzLiMc7qO/fz1u91nsfH8tepOZouPm7fOYJza2UOq2sKw8k5wMA7Wjgq4vLFC803slDEYNTqqC51BieB9Lw2MA1NXVEQ6Hp1RuAcgtUDzVHa17edGb3oOW9UrdXjr07PBDsS7uzeLcxVwx8wr+uf2ffNDxwdjtlZU/YOGCB9FoUqsxgxASMfvoVdZoc/J4HV3tBUYAhpkz0RcXM/j80bPISJZlum6/HY3bjevTEzfDlvR6sq7/KqVPPI6+vJy2736Xxi98gZDK602OZiYj6B5gz2Fj8+hte3Ir8GkhRDPwLDD1GbUpIoSg22Anzx8klm0kGpN5rVoRqYrRBhK64VlkiQH+tuJutJKWlxtf5oInLuCGV25gXfs6TjjhBBYsns+2zrdYeKWDFV+YjUYn8do/q/nbd9fSsOXgovTwjofJNmVzRuEZdPe8DsTIdJ5Jzfr3KF+4ZJ/Gya39ft6r6+XC+fkIISjPso6VXHQFhcjhMJHO8aPn2o1d+AMa5hQ3wYs/gOEeqqqq0Ov1lJaWTul1KyhRRvjd3Xs9t7V3gsmpeM8Pg8mmLh6MmxfcjMfq4Qdv/4CR8Mhh7Ws6IJyjf8f+euV3SZCfJD+6EALbx1Yz8t77RLoPskYhSYy88w4j77+P+9prD9pZyVBeTvE//k7urbcS2LyF2gsupPvePyGHJ15pncokamHRFcD9siwXAB8D/i4mSEcSQlwrhFgvhFjf1TWJ1YoHIWgvxkwbbqsOvVHDS9uUJe22TDe55V7aPooCMXJoH8v8vmzGZWzs3MgXXvgCVzxzBWKmID8/n6f++yTOUg2XfncxF37tBMx2A8/es4mGrfsX9aZBpYnFJZWXoJW0dHevQa/PZqAZAr6hCcstT33UiizDRfOVz8SyLAu1XT5kWUY36nQJ7VV22fpmK7ZMI0VXXA/BQeSXfkh1dTVer3fKfnqT2YAmZmBgsH/3jd27oOoZZVWo/vBcA1NJXTwQZp2Z206+jaahJu7YcMfBHzDN0Tq9xCSQ+3aPLvPjdfQedevooHQyIhZj6KWXVD/WwZBlmc7bf4c2Lw/HJPPghSThvPwyyp55Butpp9H1m99Q98lL8W/erPLZHl1MRtBbgD1DSApGb9uTq4FHAGRZfgcwAvsYo2VZvleW5UWyLC/Kyjr85dQisxytaKfQH8OaZ+HVqk6CEaVcUbF4Gc0bO9Dr8ujseoGrZl2FRmiIyTFevORFfrD0BwyHh/l/a/8fT9meIiIiPPjQgwSDQQpmOLnoayeQmW/luT9spnE/ov7ojkfRCA2f8H6CWCxET88buN1nsmvdu2j1BkqOX7DPY57Y0ML8QgclbkU4y9xWBgMRun2h3YuL9pgY7e8YoaW6j9mn5CPy5sCyryI2/oPM4R1TLrfEMWgt+Eb2mAh+9y7FUbPk8CcaD8e6uDeLcxdz5cwr+ef2f7Kufd1h7+9oxmQqxm/QIPfuzvBPZh3dUOlFX17O4HPPq36sg+F79VUCmzaR9dXrkPRTC0TT5WRTcOcdFPz+TqK9vdRfdjkdP/vZfhd6pRqTEfR1gFcIUSqE0KNMej611zaNwHIAIcQsFEE//CH4QbDkVqIVbeT7YoSyjPiCEd6tVWbqK5YsAwQiUElv79s49SYuKL+AJ3Y9wXB4mEtnXMqTFz3JnWfdSZYzi1edr9Ld082v7vsVrUOtGC06LrhpPs48M8/+YTON28aLejAa5PFdj3NW0VnkWHLo63+faNSH23UWu95/h5LjF6AzGsc9prp9iKr2IS6av9uVUp6t5K/UdvnQ5eWBJBHaw7q49a1WJEns9iOf/m38BjfnsYaK0kOzXFpNNgKR0Tf4cLfiPT/+crBOrbn2RByudXFvblpwE4W2Qn749g9TuvQS96LLvTVjt2XmWzBYtEkru2SsWsXIunWE9yr5JRM5FqPrd3egLy7GftFFh7wf29lnU/bM0zguu5TeB/5GzfnnHxN9VA8q6LIsR4DrgReA7Shulq1CiNuEEBeMbvZ14BohxEfAg8Dn5CRMN7sLKxCiA8+wTHeGFrNew0vb2gHI9BTiyi+ga5sGWQ7R0/M6nz/u84RjYf65XVlKLwmJMwrP4L5V9/H7S35PpCJCtCPKTffdxLde/xYhnZ8LbzoBR64i6k3bd9u64k0sLp2hNCLo7n4ZSTIS7MvC19c7YVTuExtb0EiC847fLehloyP12u5hhE6HLi9vbIQeDceoeqeNkuPdWOyjKwz1Fl7WryKbHkwb/3pIr1tGhoOoCBIMhJRVoZEALEvctIcjN2/SuegHw6wzc9tJt9Hsa+b2D29PyD6PRkymIgJGDWJg98WvkuvipDUJE6Mw6naRZYZePHJll8HnniNYXY37hhsQhxnPsafFUTKbafrSl2m55etHzTyBGkyqhi7L8rOyLFfKslwuy/L/jN72Q1mWnxr9eZssyyfLsny8LMvzZVl+Uc2TjlOabaePETz+GGgEC8tcvLStg9io9bBiyTLq3mtHq3XS1fUixRnFnF18Ng9XPYwvND7Yf457Dv/zqf+hfEY5c3rn8MH2D3hsx2MYrTouvHk+jmwzz9y9iaYqRdTjTSxOzD0RWZbp7lqDy3UKNes+QNJoKFswvv9mLCbz5IYWTvW6cVt3L//2OEwYtBI1naMTo3tYF2s2dhLw7V4ZCtDV1cUHQ1n05pyshHf1TX1G3+1WUhpbaxqU3JbK1ZBVOeX97A9nXj59HYduXdybRbmL+PSsT/Ng1YMpW3qJj9Cl4LDSkHuUfK+Dwe4AQ73q5qMDGCoqMHi9R6yTkRyJ0H3HnRgqK5XgsARhXrCA0v/8B/cN1zP00kvUnHse/Y/9JyUtjtMubXFPsm0GWgXk+5U2a5VlTjoGg2xuUf4hvIuXIUdldNE5dPe8RiwW5OrjrmYoPMS/d/x7n/0JIbjs4svIzs5maddSdrXtAsBk1XPh1+bjyDbx7F2bWLvuIz7q+ohLZ1yKEALfcDWBYCvuTKXcUnTc8Rgt46Ns19X30joQGJsMjSNJglK3hdru3dbFULMyQt/6RisZbiOFM11j28ezz7Xn/Z+y+Oe5b0/9dctT5i+iG/4JIz2HtMz/QDhyPUSCh2dd3JsbF9xIka2IX677ZcL2eTSh0ZiJWB3KL3t8SHtmKLcla5RuW70K/wcfEO6YRGZ+ghl44glCDQ1k3XQjYgprKyaDpNeT9VXF4mioqKDte9+j8fOpZ3Gc1oIuhKDbaMEzonS4d+Vb0UiCF0fLLjllFVhdmfTuNBKN+ujte4c57jmcmHcif9v2N0LRfWNU9Xo9l112GRpZQ3BLcOxT3GTVc+HNJ5CRZeLD+zspGZrNBeVKxam7S1mQIQIV9He0TehueWJjKyadhhWzc/a5rzzLSu0e1sVoTw89td207uxXJkP3yB+prq4mPz+fjMLZSjOMHc8pDpUp4CnKQSCT2/AQ5C9QVqMmkERZF/fEpDWxqnQVO/p2TPh3SwVkx6j3oK9+7LbMfCsGc3Lq6DDqdgGGXnghKceLEwuF6Lr7bozz5mE96yzVjmMoL6f4739TLI5bRi2Of7w3ZSyO01rQAQL2IrLDnRhj0EmME0tdvLhVGV0ISVJa063tRCOZ6epSKkFfOO4LdPm7eLr26Qn3mZmZia5Sh3XAOhZPC2Cy6Vn+VS8Dhm7O2f5FfA3KlUF39xoyMuZT/2EVCEH5ovE5KMFIlGc3t3HOnBwshn3rguVZFhp7R8aFdG15uXZ0MnR3uWVoaIjm5ubdUblLvwLZc5TwruDke0Nm5TmplOuwhtsOa5n//kiUdXFvKhwVxOQYdQN1Cd3v0YJweZUf+nePGnf70ZMzQjeUlWKYOTPpnYz6H36ESGubMjpXeRXwOIvj6afT9dvfUnfJJ/Fv2qTqcZPBtBd0kVmGTrThCUZp9IdYMTuHnZ0+6kZLGBWLlxH2hzFIc+nqehlZjrIsbxmzXLO4b8t9RGMTrwitPL6SAd0ATz/79Lgm0y93Ps+Ts+7E4tLz9O8/on5rHYNDm8hyL2fn++/gmTEbi8M5bl+vVXcx4A9z4Ql7r8dSKMuyjoV06QoLiUpadmwZonR+FuaM3bat+IfLmF1Ro4PzfguDzfD6zyf9mkkaiVPEhwzghFkXHPwBUySR1sU9KXeUA1DTX3OQLacnensFYY1A7qsdd7un0pm0OjpAxqpV+DduJNyq+oJvAGIjI3T/8Y+YlyxJakiaLiebgjt+R8Fdvyfa30/9ZZfT+ZvfJu34ajDtBd2ct9u62OAPjpU04m6XglnHYbRYGWywEw73MDCwASEEX5j7BeoH63m16dUJ91vhqmBj5kZ8gz7eeustQFnw8HD1w5TnFXPpN5Zicxl5/p4aRroq0MvH0d1YP2G55cmNLWRa9JxaMXFmeVmW4nSp6RpGV1BAl/sEQiHBnNPGh25VVVXhdDrJzt7DXlh0Iiz4LLxzN7TvLxd8L5rep5Am3mcJaBIfuJlo62KckowSNEJDzUBqCrrZVETAKBHr2Tnu9vzRfPRk+NFh1O0CDL6QFG8Dvf/8J9HubrJuvvmIZPTYli+n7Jmnsa06h5577z0i8weJYtoLepanAol2PCMyjYEQHoeJ2XkZY2UXjVZL2cIl1L7VjRC6sbLLiqIVFNoK+cvmv0w4211uL6fb1I2lyMLbb79NT08P6zvWUztQy6UzLsViN3Dh105Ab/HR/ObNbHtdmQD0Llk2bj+DgTAvb+/kvHl5aDUTv9xlWcoEak2XD43DQWvhaVg0fgoqd4/0g8EgdXV1zJw5c983/dm3gskBz9wCsdjBX7S1dxIUZtbJM4hNZvtDIJHWxTh6jZ5CW2HKjtCNJiVGV96jhg7g9sTr6Mkpu+iLizHOnp0Ut0t0aIieP/8Fy+mnYV5wgurH2x8aqxXnFVcAENy564idx+Ey7QW9JMdJvxjBMxLDF43RG46yck4OHzT20e1TOsxULFnGSL8fk+44OrteRJZlNJKGz835HFt6tkxohXMYHbiMLoZKhtBqtTz77LM8XPUwGfoMVpUqIxijNUbB6T/HaItS/a6FzKLFZGSNX5zz/JZ2QpEYF+2n3AJgNWjHQrr62kbot5VRFNkxbjJ0165dRKPRiVeHml2w8qdKwNaGvx/4Beuthe3/pc6+gpDQ0N+jTo/VRFsX41Q4KlJW0E2mQvwmDZrBdiX9cpRk5rrEsa1eRWDTJkLNey8KTyy9991PbGCA7JtuUvU4k8HgVeYwgjt3HmTLo5dpL+hZNgNtIqZ40YGGQJCVs3ORZVizXRmll8w7Aa3ewEhrFoFAEz6fYv27sOJCMo2Z/GXLXybcd7mjnBp/DWeeeSZb67fycuPLXFhxISat0hast/dtNIYuTrlSIhYdwj9yMu21A+P28cSGFoozzcwvdBzweZS5lZCurW+1IIiR2/L2uPurqqowm80UFhZOvIPjr4DiU+ClHyqrP/fHO3eDpKW/TAnhamlQ5/LSkZNPJBhkuC+xGdtljjIahxpT0uli0GcTNOkR0TD4xv9dPJVOBrv8yaujrx51u6iYwBjp7aX3/vuxnXMOxtl7J3InH63TicbtJrgrLehHDCEE3SYTnhFlRNPoDzErz4bHYRoL69IZjJQcfwJ1a/sBMVZ2MWgMfHr2p1nbupbtPdv32Xe5vZyagRoWL15MV24XUTnKx0t3N9Pt7l6DRmNlqCVIaOhRzHYD/71jI+11iqi3DwR4p7aHC+d7DlobLM+20NDpo/qddjyWfqTGncij5ZBoNMqOHTuorKzcf/a5EEp3o9Cwksg4ESO9sOEfMO8yMsuUkX57izoJDY68UadLgssuqex0EUIimqGkYe69YCzf6wCSV0fXFxRgnDtX1WyXnj/9mVggQNaNqoezThpDRUW65HKk8Wd4KPArNewGfwghBCvn5PDGzm6Gg4pHvWLxMgbahjAb5tDVvXuy59IZl2LRWbhvy3377LfcUc5weJh2fzt1tjqy/dk0bFT+0WQ5RnfPK2Rmns6u99fhyndy8TcXY7Tp+e/vNtJRN8h/x5IVD95RqMxtJW9IJuiPMKMc5FCIyGgiZX19PcFg8OBhXFkzFBviR/+C+rf2vX/dXyDih5OuJ79YmTzeJ0Y3QahlXSyzlwFQO1B7kC2nKY7RfJ7+8YKeWaDU0ZO1wAiUUXpg61ZCjY0J33e4o5O+f/0L+/nnYygvT/j+DxWD10tw166xwdR0IyUEXWSWY5NbyQzHaAwodfMVs3MIRWK8uVMRxbKFSxCSRLA7H5+vCr9feZNm6DO4dMalvNDwAk2D42Nr4za5p3Y9RVewi7NcZ7F27Vq6uroYHNxEKNRNhmUZzdu24F1yElankYu+dgJGq46n7tjIq+80cXyBfWzS80CUZ1s5PqjB4DTgOU4R23gEQFVVFVqtlrKysoO/GKd9ExxF8PQtENmjLBEOwPt/hIoVkD0Li82EFNMz0N9/8H0eAmpZF0vtpWiEhl3903cUdSA0mcoag70nRiVJkFeR3Dp6xjkrAVQZpXff8wfkWAz3DdcnfN+Hg8FbgTwyQrg1se/bZJESgh5PXfSMxGjwKyK2pMSF3aTjxdGyi8lqo3D2XBrfUxL7Ort2j9Lj0boPbHtg3H7jo8Fn654l25zNV1d9FZ1Ox7PPPktX9xqE0NBfZ0SWY2PuFpvLyEW3LEBj1LCwPsL5hRNbFffGHRUURDVIFVYMRUqdPNTUjCzLVFdXU15ejn4yUaJ6M3zs19BdDe/cufv2TQ/DcNe4Zf5GrQWff2hS5zdV1LIuprzTxVZOUC8h9+zY5z5PpYOBLj++vuTU0XUeD6bjj2fw+cQKeqi5mf5H/43jkovRFxQkdN+Hi6EiPjG67+s/HUgJQXcXepFEO55hmQa/MkLXaiSWz8rmlapOIlHl8qliyTK6anowGSro6tq9tDnLnMUF5Rfw+M7H6fbvnlDMNGWSoc+gfrCeS7yX4MhwsHz5curq6mhpfga7fRG16zaRkZVNdunuy0aby0jvEjtBIRN7rZOuxoOLZs9HPUSRaXdq0OXngxCEm5poa2tjcHBwatnnlSuVBUOv/1JZRh6LwTu/h9x5UHra2GYWo41AePIrTKeKGtZFGJ2sTlFBNxmVkK5Y775XIPF89KSO0j+2muD27QTrEjdn0f37uxAaDe4vfyVh+0wUBm8FMH2tiykh6MU5Tobw4fHHaAmGCY+mLa6cnUP/SJh19UrdMd6aLjpQwsDABoLB3ROCn5vzOcKxMP/a/q9x+447Wi6uvBiARYsWUVBgIhJtwGZZRsOmDXiXLBs36SnLMk/s6KT2OAsGk4Ynb99wQFEPh6JUv9dBW4bErkE/Qq9Hm5dLqLmJqqoqhBBUVk4xDXHVz0HSwrPfhJ0vQPcOOOnGccv8MzIcRAgQCqmTY6GWdbHcUU7TUFNKOl2U1EUNYmDfZuGZBVb0puTW0W3nnAPAUIJG6cGaGgaeegrnlVeiyzn8/P1Eo7HZ0OblTVunS0oIepbVQLsUwzMSIwa0BJV/9NMqszBopbGwrnhrutYNEUCmq3t37nOJvYSzi8/moaqHxqJ1A5EAfYE+tEJLlklJKJQkiSUnKo0rPno/QDQSoWKv1aEfNPTR3OfnY0sL+fgtC9AZNDz5uw10N08s6rvWdxLyR4gUW8ZCuvQFhYSbmqmqqqKoqAiLZYqt4eweOPO7sPNF+O9NkFEAcy4at0lmpgsEtDao09BALetihaOCqBxNSaeLyVRIwCgh+XogOv6DVor70ZPkdAHQ5eZiWrAgYXX0rjvuRDIaybz2moTsTw2ms9MlJQRdSV00kO9XRoLxOrpZr+WUCjcvbu0YGyVWLF5G86Z2DPqCMftinL2jdV9seJFQLEREjtDl3z2aj0Y+IBbNZmvNMPrMbPIrx5dDHt/QglEnsXJOLhluExfdsgCdXsOTv91Id/O+JY5tb7XgzDWTW5FBU5+fYCSKrrCA3u4uOjs7D7nVHEu+BDlzFU/z0q8o2S97kJ2r1PfbmlQSdJWsi6nsdNFozIQtDoQsw0DzPvd7Kh0MdPrx9QWTdk4Zq1cT3LGDYM3hlbkC27Yx9MILuD73WbRO58EfcIQweL2EamqQoxPnPB3NpISgA4xk5FIwOsEXd7oArJyTQ0u/n+1tyn1jren8lfT1vUsksnvUHI/W/fu2vxOKhni4+mFyzYovOF6zjUSG6Ot/j/z8VYhohHBBOXv2ww5FYjyzuY0Vs3OxjiYr2rNMXHTLCWh0Ek/evoGelt2i3tPio712kNmn5FOebSMak2nsGUFfWEijSel2PpauOFU0Wvj4PTD/07Dws/vcHbcudnWo08HFqUKMLqS+00V2jK4q3svpAnv2GU1i2WXlShDisEfpnb/7HZLdjuvzn0/QmamDwetFDoVUsWuqTcoIOu5y8oIt6GLy2AgdYPmsHIRgrOwSb03XuVVClsN0d48P5/rCcV+g09/Jr9b9ik1dm8Zq5/HRYE/PG8hyBHmoFH1nM4OhMJv36Cz+xo4u+kfCfPyE8d5ze5ZZEXWN4Inf7hb1rW+0oNFKzFyWt1dIVyEtBR6yHA5cLheHTO5xcNFdYLDtc1eOJxNkQW9vYksicWzuLDRabcJH6KnudBHO0Qn2/n2bL8Tr6MmcGNXlZGNetIjBw1g1OvLhBoZff4PML16Nxrbve/FowlARnxidfnX0lBF0c64Xg2gjPxAbJ+huq4GFRc6xsC6It6ZrRadz71N2iUfrPlT9EEaNkStmXkGGPmNMPLq716DTuWj+sAtr2E9eXh4vvvgigYBiJXtiYwsui55TvVn7nKMj26xYGjWCJ2/fQEf9INXvtVO+MAujRUepOy7oPiLZWXS73ZRnZCT8tYqj0UjoMDM4pE6ei1rWRUhtp4vWNZOYALl335KSJAnyK+xJC+qKY1u9itCuGgI7pm7nk2WZrttvR+N24/rUp1Q4u8RiKC8DIdKCfiTJKpyBRrSTPxIbsy7GWTknh21tgzT3KR507+JlyDEZbWQ2Pb2vE43u9vXGo3UBVpeuxm6wj4lHLBahu+c1XK7TqVm/jooFSzjvvPPw+Xy89tprDAXCvLStg3Pn5qHbT7KiI0cRdSEJ/vPLDwgFosw5VbnEthl1YyFd9aEQsiRRonIdz6S3MhJUx4sO6loXU9bpYikmYJCI9lRPeH9+pZOBTj/D/Umso69cCZJ0SG6XkXfeYeT993F/6UtIZrMKZ5dYJLMZXWEhwV3Tr6SXMoJenOPCJ4bwjMj7CPqK2UodPJ7tklPuVVrT7TASjY7Q2zc+CGtF0QpuPOFGrpt/HaCIx67+XfT3rycSGUCMVBAY9lFx4kl4PB4WLlzIe++9xyNrdxCMxLjohAMv9XfkmMdWlGZ6rOSV28fuK3Nbqe32sbO5GbPfj71DnQnLODZLBsHYsGoNc9VMXUxVp4vRVETAqEHum/i5eUbz0VuSWEfXut2Ylyxh8NnnpvS3lGWZztt/hzY/D8dll6p4holFcbqkR+hHDLdVT6eI4PHHGIjGGAhHxu4rdVvwZlvHBF0IMdqarh2NxrZP2UUjabhm3jXkWpQPgnJ7OYOhQZo7nkEIPW2bAmgNBkrmKfnNy5cvx2g08s+3qil0mlhQdPAZfGeuhStvPZELvzZ/nIe9PNtCTaePnTt3UejzEWne1+mQSJwOJ7KI0t+jzihdtdTFFHa6mE1F+I0S0sDE0bXuQht6oyapdXRQOhmF6usJVk985TARvldfJbBpE1nXXYc0mZXORwkGr5dQfQNyaHpdAaaMoCvWRf3u1MXA+D/Eyjk5vFfXS/+IcnvF4mWEA2EMYi7d3WuIxSL77DNOmaMMkOnufgWncym73v+A0vkL0RkUP7rZbGbhKWdRN6JnaZ520l1XDGYdJuv4N3mZ28pgIIIvAqV6PaHmfReYJBJ3diYArY0qxeiqZF0stZciCSklnS56fRYBkwFNwDdhr1hJEuR5HbQmWdBtK1eARjNpt4sci9F1++/QFxdjv+gidU8uwRi8XohECNbXH+lTmRIpI+gAvowsPH6lHr7nxCgoZZdoTOaVKqWEEW9NN1BvIxzuY2Bg/X73W24vJ1srI4fb0UXnMNzft0+rubqoCxmBrnUDfr//kJ9D3Oni19ooys4hPJrnoha5HmXyVq0YXbWsi3qNniJbUUpOjCoxuqOrKPsnts55vE76O0YYHkheHV3rcmE58UQGn5tc2WXw2ecI7tiB+8YbENrEtzpUk90RANOr7JJSgi7cZRQFFOFo2GuEPs9jJyfDMFZ22d2argsh9OPCuvYm25zNQqvyhuyq1iBptJQtWDxum6c+amNGlglDsJ9XX524T+lkKHMrk0aGrGKMhYXIweBYjK4aeEqUslJ3lzoxujZ3FpIm8dZFSG2nC44i5fsE1kUAzwwHQPJH6atXEW5sJLBt2wG3kyMRuu+8E8OMGWPNMqYT+tJS0GjSgn4kMedW4oy2YA/v63SRJMGK2Tm8vqOLQFhxjlQsWYZ/wI9ZdzzdXS/td9QhhOB4i6A3ZqHmnS0Uzz0eg3n3UvyaLh+bWwb45JISFi1axLp162hrO7QRaXSoGw0xIuZM9IVKEl1YxTq6zW5Giuno71dngk2SNNhzctNOlykiZSrZPXLvxBOj7gLraB09yfbFs88GrfagbpeBJ54g1NBA1k03IvbXlOUoRtLr0ZeUTDuny/R7pQ+Au3AG2lHrYqN/33/yFbNzGQlFWVujrIyMt6Ybbs0kEGxlaGjLhPsNhXrIkXzs6NMw0NmxT3bLkxtakARccHw+Z511FiaTiWeeeeaQGjDvqK4mQwTpi+jRjbabU1PQAQwaC74R9ayLztw8dbzo9nKicpT6wfqE7/tIo3fMICJBrHdi37ekkchLcp9RUNq0WZYtO6DbJRYK0XX33RjnzcN65plJPb9EMh2dLikl6MW5mfjEIAUjMg0j+9YWl5VlYjNoxxYZKa3pFlD7dh8CzT5ulzg9Pa8hkOmq0SsOmUUnjt0nyzJPbGzlpHI32RlGTCYTK1eupLm5mY8++mhK5y/LMlVVVXhsGup7RtB5FH96qEndiVGL0YZfxRhdNVMXgZQsu5jMxQSMmv0KOiht6ZJdRwfF7RJuaSGwZeIBUP/DjxBpbSP75psmbRA4GjF4vYQbm4gdxpxYskkpQc+06OkWYTz+GM3BENG9BESvlTh9RhYvb+8gGouHdS1lsH0Qs+G4cemLe9LV/QponVh3ZpBRXoTZ7hi778PGfhp7R7joBM/YbfPmzaOwsJCXXnqJkZGRSZ9/Z2cnfX19zMh30tTnJ6zRos1RJkbVRInR9RMO79/pczioZV1MZaeLkouugb7954nsznXpT9JZKdjOXg46HYPP7hsFEBsZofuPf8S8ZAnmZcuSel6JxuD1giwTrJ0+1tiUEnQhBL0GHR6/TBhoC+6b871yTi7dvhAbm5TaY7w1XaArj+HhnQwPj//jRaNBenvfwGJcjNNngL2W9D+5sQWDVuKcOTljt0mSxLnnnovf7+eVV16Z9PlXVVUBsGhGwVhIl66wQHXrYjxGt61pejWMjjtdavunzz/cZInH6GoGO2A/VzZZhVZ0R8CPrrHbsZ50EoPPP7/PVVfvP/5JtLubrJtvntajc5ieTpeUEnSAIbuT/BFlpLn3xCjAGTOy0GnEWNlld2u6YYB9yi79/e8SjY4Q7VTsd50Fu9/A4WiMpze1cfbsHGzG8dG0ubm5LFmyhPXr19PSMvECkb2prq6moKCAOaNt62q6hsdy0dUkO1fxoqsVo6uWdRF2r+JNNTQaEyGrHSkSgpGJHUiSRiK/wpHUhhdxbKtXEWlrI7BHWTE6OEjPX/6C9fTTMS84IennlGj0RUUInS71BF0IsUoIUS2E2CWE+M5+trlUCLFNCLFVCPGvibZJBnJmGUUBZaS59+IigAyjjqVlmby4bY+M9CXL6KrtxmScQVf3eEHv6n4FSTJR/243w5kSNdHd4vrmzi56h0NcNN/DRJx55plYrdZJTZAODAzQ2trKjBkzxkK6art96AoLiHR0EAuqVyfNK1SuLjrb1YnRVdu6mKpOl5hd+SCkb2LrIkB+pYO+9uTX0W3LlyN0Ogaf21126b3/fmIDA2TdfFNSz0UthFaLvrx8WjldDiroQggNcBewGpgNXCGEmL3XNl7g/wEny7I8B7g58ac6Ocx5lXiCTWhkeUKnCyhll7ruYWpGuwNVLB5tTddfzODgRwSCStSuLMt0d79MhnUx7TtrkWbkjJuAe2JDKw6zjtMr901WBDAajaxcuZLW1lY2bNhwwPOuHl1OPXPmzLGQrprOYfRxp8skR/mHQm7BaIxujzoxuqpaF1PY6SKcSrwB/fX73cbjPTJ1dI3NhuXUUxl8/gXkWIxIby+99z+AbdUqjLNmJfVc1GS6OV0mM0JfAuySZblWluUQ8BBw4V7bXAPcJctyH4Asy+omSh0Ad9EMTLSR44/ts7gozopZyoj0hdGyi83lJreikpYPlZp7V5cyOerzbSMYbCfUo9SA8+bPoyfQQ3+gn+FgZCxZUa/d/8s4d+5ciouLefnllxkeHt7vdlVVVWRmZpKVpXw4xEO6dAWjgq6i00Wr06LDxODQgGrHUM26mMJOF02mIoyxCWJ042QVWdEZNElfYARKJ6NIRwf+DRvo+dOfiQUCZN14Q9LPQ00MXi+R1jaiPvVcYIlkMoLuAfZUk+bR2/akEqgUQrwthHhXCLFqoh0JIa4VQqwXQqzvUmn1Y1GOmxH6KfDLNAwHJtwm127k+AL72KpRUBpIt2xux2goHqujd3W/AgiaPhgms6AIb/nxgBII9eK2dvzh6Dh3y0QIITj33HMJBAKsWbNmwm38fj/19fXjWs2VZVmo7RpGVxC3LqpbRzfprAwH1IzRVce6mMpOF2NGBSGdINqzfb/bSBqJvApH0hcYAVjPPBOh19P7wN/o+9e/sF9wAYaysqSfh5oYvF4AQtOk7JKoSVEt4AXOAK4A/iSEcOy9kSzL98qyvEiW5UXxkWiiybTo6RFh8v2xffJc9mTlnFw2NvXTMaiIvtKaDhjx0t//HuFwP93dL2O1zKV5Uy3eE0+iwqHMeu/q38XjG1rxOEwsnESyYnZ2NkuXLuXDDz+keYJFQrt27SIWi40T9PIsKwP+MP2mDITRqOoIHcBqySAU2/8VxOHizFXHupjaTpci/EYNcu+BxcQzWkcfGUzuPILGasF6+mkMvfgiciyG+/qvJvX4ySDudAlMk7LLZAS9BSjc4/eC0dv2pBl4SpblsCzLdcAOFIFPOkII+oxaPCMy3dEow/tpELFitlJ2iY/S463pOrYIZDlKS8uDDA1tQfaVIssxvEtOIteSi0lrYktHI2/t7OKiE/KRpMlZs8444wxsNtuEE6RVVVVYLBY8nt2j/XhIV133CPrCAkIqrxZ1OBzERISBPpVidFWyLkLqOl3i1sX9xejGyY/nox+BUXo8p8X5yUvQFxQk/fhqo/N4ECbTtKmjT0bQ1wFeIUSpEEIPXA48tdc2T6CMzhFCuFFKMEdsyDSUYcfjV0RzfxOj3mwrJZnm8WWXJcuof78FvS6buvrfA9C+OYo9O4esYuXSvsxexns7w8Rk9utumQiDwcA555xDW1sbH3zwwdjtkUiEnTt3MmPGDKQ9Mi/Ks6yAkhOjKyhUfYTuzlKsiy0N6sToqm1dTEWni16fTcCkR+Prhdj+O1dlFdmOWB3devbZZN18M+4bUqt2HkdIEoaKitQpuciyHAGuB14AtgOPyLK8VQhxmxDigtHNXgB6hBDbgFeBb8qyrE583ySIZZVQOKKMViayLoIykl8xO4e1Nd0MBZTJ0D1b08ViAYyGAurer6diyUljiyTKHeXUt2QxOy8Db87Umt3OmTOH0tJS1qxZg290kqWuro5QKDSu3ALgcZgwaCVqu3yji4vUjtFV4lo7VIrRVdW6mKJOFyEEUVs2IhaDwf2/bhqNRF6FnZYkO11ACbFyf/lLaJ0HLz1OVwwVFSlVckGW5WdlWa6UZblcluX/Gb3th7IsPzX6syzL8i2yLM+WZXmuLMsPqXnSB8Oc46U4oPwDTLS4KM7KObmEozKv71BELKfcizXTTc8OAwCa8Cxi0ei47HOHVElwJI9VczOnfF5CCD72sY8RCoV4+eWXAaXcotPpKC0tHbetJAlK3crEqL6gEHlkhGivOrZCAE+JIuhdnep8Dqudugip6XSRHaNljP3E6MbxVDrpaxtOeh39WMDg9RLt6ibSl/yS1lRJuZWiAJlFM8mMNmKJHHhidEGRk0yLfmzVqBK8tZSat9spKvgSXVucWBxO8r0zxh7T0pYHxJhTfGiBPVlZWSxbtoyNGzfS0NBAdXU1Xq8XnU63z7blWVal5BKP0VWx7GJ32kZjdPtVO4Za1sUSewmSkFJS0IVLmZTbX4xunHgdPdl+9GOB6eR0SUlBL8rNIki/0jB6gtTFOBpJsHxWNq9WdRKKKDX3isVLiQRDRDqXUPv+DioWLxvLc5ZlmfW7BBpzLf3R+kM+v9NOO42MjAweeeQRfD4fM2bMmHC7siwLTX1+yFcEXW3rol5jZmhEPS+6WtZFg8aQst2LtJmzkYFYz4H7eGYV2dAakp+PfiwwnZwuKSnoLouePimExx+jcfjAS6JXzs5lKBjhvTql1BBvTffGP+8jEgyOK7d81DxAc18Ik3MrNQOHLh4Gg4FVq1YxPDyMEILKysoJtyvLshCNybSZlPpkWOWQLovBRiA0/ayLkMJOF2sZAYNEtKfqgNtpNBL55fakB3UdC2hzcpBstmnhdElJQRdC0KfXKCP0UPiAI8JTvG5MOs1Y2SXems7X24PRYqVg9nFj2z6xoQW9VqKyYPiwR4OzZs1i5syZzJ49G5PJNOE2cadL7WAEbVaW6iP0jAw7YfxEIvt3VBwOjlGnixoTo2X2spR0uphMRQSM0kFr6DCa65KuoyccIQQGr5fQzqN/wJCSgg4waM/A448RRKYztP+cb6NOw2mVbl7aK6wLoHzRiWhGm9tGojGe3tTK2bOymeE+/Mt7IQSXX345n/zkJ/e7zfiQLvWtiy6XC4RMR7M6IV3OUS+6GtbFCkdFSjpdTMYC/EYN0mD7Qbc9UvnoxwLxTBc1nWaJIGUFPeouwONXmkscyOkCStmlfTDA5halflxy/AIqFi9l/jnnjW3z1q5uun0hLpzvodxRTsdIB76QuvkONqOObJtBcbokYXFRdq4S29vapI4XXe3URUg9p4tGYyJsyUA7MgjhA0/EZxUrdfQjEaeb6hi8XqIDA6o2bE8EKSvo5rxKiv2KcOzPix7nrJnZaKTdGek6vYELv/F9cst3L3Z9cmMrGUYtZ8zIosyu5FXUDqi/dmrM6VJQSKS9nVhIvcvpsRjdNnVG6GpaF1PZ6RK15yo/9B/4Ck2jkcgrPzJ+9FRnujhdUlbQM4tmUhhsQsjyAa2LAE6LnsUlznGrRvdkJBThha3tnDsvH4NWM5bpkgzxiId0aQsKQJZVjdHNK3KPxuiqN8JTy7qYyk4XnMXK90nU0T2VDnpbh/EPpevoiWS6dC9KWUEvys1ByN1kBeSDllwAVszOpbpjiPrufV0eL23rYCQU5aL5Sg3YY/Wgl/RJEY94SJcvWzl2WMWyi06nRYuRgaF+1Y6hlnURUtfpsjtG9+DPLV1HVwdtZiYal+uoty6mrKA7zTr6RIACf4wG38Qxunuycq+wrj15YkML+XYji0tcAGgkDaX20sOyLk6WeEhXk1FZmRpSeWLUpLMy7FcvRldN62KqOl30zllEBUS7tx1026xiG1q9lLYvqsB0cLqkrKALIeg3SOT7D15yASh0mZmVl7GPoPf4gryxs5sL5nvGJSuWOcqSEtkaty42RLQIg0H1/qJWcwZBFWN0HSqGdKWs08VcTMCoIdZ78NGh5gjmo6c608HpkrKCDjCUYcPjj9ERixKIHrinJyiRuusbeunx7S7RPL2pjWhM5uN7NbKocFTQOtzKSHgk4ee9J/nxkK7uYXQFBaovLlJidMMMDagj6k6VY3Qh9ZwuSi66hDjIpGicfO9oHd2XWlcqRxqD10tsZIRIa+Lfu4kipQU97M6nYCSIDDQHD/7mXjk7h5gMa6p2d9B7YmMLM3NtzMgdn6xYblfEQ22ni2ZcSFeB6ouL3G51Y3RtmepZF1PV6aLXZykxukOTs8yl6+jqMDYxehQ7XVJa0C35Xgr9yj/BZMouc/Iz8DhMY/bFhp5hNjT2T9hmrsyhWBeT5nTpHh5bXKTmJV9OvtJJqr1ZHb+tpFHPupiqThclRteNJhQEf/9Bt89O19FVwVBx9DtdUlrQXYUzKQkoI9rJOF3iGelv7uxiJBThyY2tCAEXHJ+/z7aFtkJ0ki4pE6PlWVYae0cQBQXEhoeJqpiIWFCiTA53darjRQf1rIugTIymotNFdowOKiZhXdRoFT96eoFRYtHY7WhzctKCfqQoyssjI9yGITq5iVFQyi7BSIw3dnTzxIYWTix1ke/YN2tFK2kpsZckbYQejcm0u5R/alVjdDNtiJhW1Rhdta2Lqeh0EU6lxHewGN04+V4nPS3pOnqiUSZGj94BQ0oLutOiZ1D4ldTFSVgXARaXusgwarljzU5qu4cP2Gau3F6eNC86QKNRsU2qaV0UQmCQzAwND6p2DDWti6nqdNG4ZwMQ7dk+qe096Xx0VTB4vQRrapD306v4SJPSgg4woBdK6uJBYnTj6DQSy2flsK1tEL1GYvXcvP1uW+Yoo9WnvtMlHtLViHKloLZ10Wy04Q+r50VX07oYd7okw1KaTIyOmYS14qAxunGySzLQ6qQj0mc0lTF4vcjBoOpBeYdKygv6YIZVGaGHDxyjuyfxRUZnzczGbtq3k1Cccns5MrLqo8F4SFfdQAiN201IZetihtVOWPYTVSlGV03rYtzpkmp1dKOpEL9Rgt7JfVBptBK55fa0Hz3BHO1Ol5QX9HBWDp6RCD5kesOTE6jTZ2SxtMzF1aeWHnC7ZGa6xEO69AUFqo/QXS6nEqPbqk4P07h1Me10mTwmYyEBowYxMPnXzDNDqaOn89ETh6FcuQI8WidGU17QzXleCgLKKKUhMLmyi1mv5aFrl40t9d8fhRmFaIU2uSFdSchFz4rH6Daq40WPWxfVGKGD4nRJhvsomWg0RkIWG1pfH8QOvkgOoHCW8v5trlKvufixhmSxoCsoSAv6kSKzaNbuGN1JOl0mi07SUZxRnKRMFyWkayS/iHB7O7KaMboF2QB0tquX/aymdbHcUU7jYGPKOV2iGTlIsSj4JvdBm1Vkw2DW0rQ9LeiJ5Gh2uqS8oBfm55HnVyJnD5aLfigkL9NFmRhtdhVALEa4LfETinHyi7NBhp5u9YRATetiqjpdcBQp3yfhRQeQJEHBTCdN2/uO6vyR6YbB6yVYV6fqoOpQSXlBd5j1hPGRGYxRP3Tgji+HQoWjgqahJgKRydkiD5W4dbHZFLcuqldH1xt0Sozu4IBqx1C7YTSkntNFylSaicd6Jj86LJzlYrg/SF+7uk6sYwlDpRciEUINk/tgTSYpL+gAA3oZz4hMvW9yNfSpUOYoS4rTJd9hQq+VaBBmANVDuoxaKyOB6WldTFWnizZTaVge6d486cfE6+jpskviGIsAOAqdLseEoA/ZzOT7YzROclJ0KsRDutSeGNVIgjK3hbrhGEKnUz0X3WrOIBBVL0ZXTetiyjpdMioI6CVivTsm/ZgMtwl7likt6AlEX1YGknRUToweE4IeysrC44/RGosSjiW2llicUYxGaJLmdKnrHlFidFW2LjrsDmIixLAv8WUqUNe6CKnpdDEZiwgYJehrnNLjCme5aNnRTzQyOXdMmgMjGQzoi4vTgn6kMOdXUDAyREwIWicRozsV9Bo9RRnJGQ3GQ7ooLFS1FR1AZpZyqd5aPz2ti6nodNHr3UqM7mDnwTfeg8JZLiLBKB116s2JHGscrU6XY0LQnUWzKQ6MRuIm2LoIStlF7Vx02B3S1ZlfQUhlQY/H6LapFKMLo9ZFFWrooAh6qjldhBBErE60I0MQmfz72DPDgRDQtD29ajRRGLxeQo2NxALqmiGmyjEh6EX5HvJHrYuTXVw0FcocZTQOqT8aLHMrTpcWl4fY4CDRAfVGXAXF6sfoOnLz6W9Xz7oIqed0idnzEcgwMPk5FINZR05pBo3b0nX0RGGo9EIsRqj26Hp/HROCbjfrMIZ70cbUcbqU28uJyTHVR4NjDaPN6lsXHe4MhKyhr1+9UZ0jN49wMKCKdTFVnS7CNRqj2zc1y1zBLBddDYMEhsNqnNYxx9HqdJmUoAshVgkhqoUQu4QQ3znAdhcLIWQhxKLEnWJiGNbFlIbRg4mf5EtWL8t4SFcj6lsXJUlCLywM+dSN0QV1rIsGjYFCW2HKOV00mTMBiPZsndLjCme5kGVoqU6XXRKBvrgYdLqjbmL0oIIuhNAAdwGrgdnAFUKI2RNsZwNuAt5L9EkmAp/NpKQujiS+5pXMXpZlWRbqAwJQNxcdwGyw4g+p50VX07oIo3n1KeZ00WfOJSYg0r1tSo/LKc1AZ9Sk7YsJQuh0GEpLCe6YZoIOLAF2ybJcK8tyCHgIuHCC7X4C/AI4umYJRgm53XhGYjREEn/JGR8NJmNitDzLSm2PH8npVN26mGGzE5L9xCYZBjVV1LYupqLTxWQpIWCQkHun9kGl0Uh4Kp1pQU8ghoqKaVly8QB7DgWbR28bQwixACiUZfmZA+1ICHGtEGK9EGJ9V5d67omJMBSU4fH7GRCCgXAk4fsvs5claYQ+GtJVXKH6alGn0wkiRlebOiKQDOtiVI7SMHj0LdE+VIzGAvxGDVL/1D/MC2e5GOwOMNCVjgFIBIZKL+GWFqI+9RbgTZXDnhQVQkjAb4CvH2xbWZbvlWV5kSzLi7Kysg730FPCVTiLIr/i2FAjpKvCUUHjYCPhqLqTTvGQrlZPhaqTogBZOUqMbkvj1HzPU0FN62Iy8+qThRKja0Xj65nyYwtnOYG0fTFRGLxeAEI1R88ofTKC3gIU7vF7wehtcWzAccBrQoh6YCnw1NE2MVroKcATaAfU8aKXOcqIyBHVR4PxkK4WVwHh1lbkSOKvNuLkFSgfup1t6l1NqWldTFWnS9SWhTYYgKBvSo9z5JixOg3pskuCOBqdLpMR9HWAVwhRKoTQA5cDT8XvlGV5QJZltyzLJbIslwDvAhfIsrxelTM+ROxmPU6/IkyNIypmuqg8CRcP6Wo2uyAaJdzert6xkhKjO2pdVMEemcy5jaTiGB1fTTJGN44QgsLZLlqq+4hF0zEAh4uuoABhNB5VE6MHFXRZliPA9cALwHbgEVmWtwohbhNCXKD2CSaSmCaIPSRTP5B462KJvQSBUH0hSzykqyFuXVTR6WI0GdBgZGCgX7VjjFkX1Wp2YS9PuRG6cCmX+tEpxOjGKZzlIjgSobNBPffSsYLQaDCUlx9V1sVJ1dBlWX5WluVKWZbLZVn+n9HbfijL8lMTbHvG0TY6jzNsMeLxx6hXIXDKpDVRYCtIiniUZVloCCp/OrWti0atBZ9fPS+6I1dl62IKOl20WXMAiHR9NOXHFsx0gkjH6SaKo83pckysFI0Tysok3x+jIZj4kgskMdPFbaVpMERYZ1Ddumg12QiqGKOb4VbfuphqTheDczYRjSDWUz3lx5qserIKbWlBTxCGSi+Rzk6i/f1H+lSAY0zQ9QUleEbCtCATVWESrsxRRv1gPeGYyk6XbCWkq6t0JiGVrYt2u4OoCOFXYUEWqG9dTEWni8lcgt8oQX/9IT2+cJaLjtpBQgH1JtSPFeJOl6NllH5MCbqzaCaF/l4iQtAeTLzoVjgqiMQiNA2pK7LxkK52j1f1EXqmOxOAFpVidEFd62IqOl30ukwCRh3SwKFNiBfOchKLybTs6E/siR2DHG1Ol2NK0AsLiigIKJ5qtayLoP5oMB7S1ZLpUXVSFCB3LEZXPS+6mtbFVHS6jMXo+vrhEF6zvHIHWp2ULrskAG1eHpLFctQ4XY4pQc8w6ckMKIuLGvyJr6OXZpQCyQvpajJlEh0YIDqo3qRlfjxGt2PqC1kmi5rWRUhNp0vMnosmGoHhqccba3QS+V4HzWlBP2yEEBi83qPG6XJMCTqAOTqEJMvUDyR++bNZZ8Zj9SQlg7ssy0KjUEbqanYvysy2I2SJvj71Vheqbl1MQacLzhIA5L76Q3p44WwXfe0j+PqOyuilaYXBW0Fw505VrjCnyjEn6EGLnly/TL0KMbqgiEcyEv7Ks6zUBwQy6uaiS5KETlgYGk6CdbEj7XSZLBrXDAAi3VsO6fGFs5RM/XTZ5fAxeL1E+/uJ9qh3FTtZjjlBD2c68fhjNAyrE1BUbi+nbqCOSExdB0FZlpWBUIwBvUX1kC6z3oo/qN5ClDHrokoj9FR0umizjgcgeoiC7sq3YM7Q05TuYnTYjDldjoKyyzEn6NqCYjz+KI2xqCr7L3OUEY6FaR5S130Snxhtyy1VfXGRzZpBSB5RLUZX0miwZ+eoZl1MRaeLyT6DkE4Q6z00ERFCUDDLSVNVH3LsyJcKpjNjTpejoGn0MSforuKZFPgH6dVoGI4mXtSTlelSMRrS1VqgvnXR6XQhixg9Hf3qHSMvXzXrYio6XeIxuvQf+od54SwXAV+Y7uaphXylGY/G7UbjcKRH6EeCgsJiCuIxuipaF9WeGI2HdLW4PKovLsrOGfWiN6rnRVfTugip53TRaAyEzBa0g4eehJmuoyeGMafLUeBFP+YE3WbS4w4okxdqWBctOgt5ljzVxUMjCUozLTSb3YRbWpFVuNqIk1uQDUBHq5oxuipbF0edLmrn1SeTiC0T3YgPDrF8aLEbcOVb0oKeAI4Wp8sxJ+gAjrDi2GgYVMeyVeYoS047umwLDcIMkQgRFWN0PSWKF13NGN1kWBejcpT6wXpV9n8kkO0FCFmGwZaDb7wfCme5aNs1QCSk3oDgWMDg9RLz+VT9P5wMx6Sga/QylohMfb86oVNxp0tUpYnXOGVuKy0hibDQqGpdNJkNaGSDqjG6alsXU9HpIlxKee9QYnTjFM5yEY3EaN3Vn6CzOjY5WjJdjklBj2Ta8YzEqBtUx4pX4aggGA3S6lNHnOKUZ1uIytBucaluXTRqLPj809e6GHe6JGONQLLQuuMxuhsOeR/5XgeSVqTb0h0mY06XIxwBcEwKuqawUInRDanjFY9PjKpdR4+HdDXZc1XvL2ox2QhE1HNDxK2LajtdUmmErs+ajwxEu7cf8j50Bg155fZ0Hf0w0TgcaLOyjrjT5ZgUdFfJTAr8I7RqJVUmMcrsoyFdKo8Gx7zoeeWqh3TZMxxECRJUwRkUx5mXr5oXHZS/Syo5XYzWMgIGCbmv7rD2UzjLRU+zj5HBFIpGOAIYvEe+2cUxKegFRWUU+HsJShJdKozSbXob2eZs1a2L8ZCulswCQirmuQC43C4Q0NKgonUxJ09V62KFoyKlnC56XSYBkw5p4PA+BNP2xcQQty7KKi3AmwzHpKBbjTqyA8qbtyGgzqikwlGRlHptWZaFJnOm6iP0nLwkxOjm5atuXUwlp4sQgrDFjnbo8ITYXWjDaNGl0xcPE4PXixwIqBqWdzCOSUEHcIUGAKgfUiekq8xeRt1AHTFZ3U/rsiwrjcJCpK+PqE+9GrcnCTG6ybAuQmo5XaL2HHTBAIQP/X0sSYKCmU6atvcecR/1dOZoaHZxzAq6TVIuu+t6VbIuOsrxR/zqO12yrAzGpNGQLvVGBlm5TpAlenvVG8WpbV0stZemnNMFRxEAct/hJUkWznIxPBCit029/rGpjv4ocLocs4IuXFayAzFq+wZU2X98NKj2AqP4xGizNUvVkC5JI6EXZoZ86sXoqm1dTEWni5RZCUCka/Nh7adglhOA5rR98ZDRWK3o8vOPqNPlmBV0qaAAz0iMBr9Kq0XtyWlHFw/parFlqx7SZdJbGVExRldt6yKkntNF654LQLh702HtJyPThCPHnJ4YPUz0R9jpcswKuqN0BgX+IC1CnZfAbrCTZcpSXTziIV3NrgLVFxfZLOrG6IL61sVUc7oYM+cTlSDWs+Ow91U400nLjj6i4SPn0pjuGL1eQrW1yOEj8/46ZgW9oLiMAv8AXToNgag6b+AyR5nq1sV4SFdrZoHqi4ucTieyiNLXrU6ZCtS3Lqaa08VoKiBg1CD6D78bU8EsF5FQjPZa9f6+qY7B60UOhwk1Nh6R4x+zgm416skO9CMLQXNQHetiuV1pR6e2c6A820JjEqyLWdmjMbr109u6COov+koWGo2BoNmENHj4fxPPDCdCEumyy2GgP8LNLo5ZQQfIjFsXferU0eNOl/ZhdRPYytxW2iQzI63t6sboetSP0XXm5AHqWRfHnC4pNDEasWai8x3+qNpg0pJbmpEW9MPAUF4OQhyxidFjWtCdKEJer6J1EZKQ6ZJlIYqgTW8j0qne6LmgNBeA7m71vOiOPA+gnnUxFZ0uMXse2kgE/Id/VVMwy0Vn4xCB4dSYY0g2ktGIvqgoLehHAqvNgCEqU9Olzogk3o5Obeti+ajTpcmWraoX3Ww1Isl6VWN01bYuguJ0SSVBF85SAKKJmBid5QIZmqvS9sVD5Ug6XY5pQdd48vH4Y9T5RlTZv8PowGV0qS4eu73o2apPjBo1Fnwj09u6mGpOF81ojG6489BjdOPklNjQGzXpssthYPB6CTU0EAsmviPawZiUoAshVgkhqoUQu4QQ35ng/luEENuEEJuEEGuEEMWJP9XE46jwkj8SoVnFOctkZLrEQ7qabdmqWxctRnVjdEF962K5o5yIHEkZp4suez4A0e6th70vSSPhmZGOATgcjF4vRKOE6g4vBfNQOKigCyE0wF3AamA2cIUQYvZem20AFsmyPA/4N/DLRJ+oGniKvRT4h2jVaVV785bZFeui2v8cZVkWpWG0yiN0e4aDCAFCIfVGt8mwLkLqOF1MjlmEtQK5NzHPp3CWi6GeAANd6uQcpTpH0ukymRH6EmCXLMu1siyHgIeAC/fcQJblV2VZjtct3gUKEnua6mAx6skLDDCi1dAbVscdUu4oxxf20TGiXuwsKCFdzUmwLroylRjdtobpa10syShJKaeLTpdJwKhFDBx6b9E9icfpptMXDw1DSQlotUdkYnQygu4B9lSJ5tHb9sfVwHMT3SGEuFYIsV4Isb6rSz3r21RwB5VsksYRdepdY5kuKi8wKs+yMigZ6GpXz4ECkJPvBqC1ST1BH7MuqlR2MWqNFFgLUkbQlRjdDDRDifnb27NN2FxGGrelBf1QEHo9htKSo1bQJ40Q4tPAIuBXE90vy/K9siwvkmV5UVZWViIPfci4UC4r63rVqQsn6/I+PjFaH9IQG1Fnkhcgv0iJ0e3s6FbtGGPWRZXr6Kki6ACRjCz0wz5IQCyDEILCWU5aqvuIqbSKOtXRVxwZp8tkBL0FKNzj94LR28YhhDgb+B5wgSzLyZ/ePURcRg0AO9vVESiX0YXT4FRdPMrdu0O61OxelJOfCbKgr1c9W5tiXdSoal1MNacLjiIkWUYeSow7qHB2JqFAlM4G9RxNqYzB6yXc1KTq4GoiJiPo6wCvEKJUCKEHLgee2nMDIcQJwB9RxFy9a3EVsORl4wrGqOlX741b5lDf9+xxmtBL0GRV14suaSR0mBkcUi/vQ7Eu5qqbuugoSymni3ApE3Hhro8Ssr+CGU4Q6bZ0h4rB6wUgWJPcq8CDCrosyxHgeuAFYDvwiCzLW4UQtwkhLhjd7FeAFXhUCLFRCPHUfnZ31GH3evH4YzSreGmZjEwXjSQocZlptmapPjGqdowuJCd1EVLH6aLNGvWiJ0jQjVYd2UW2tKAfIoYj5HTRTmYjWZafBZ7d67Yf7vHz2Qk+r6ThKfVSsOE9NjqNqh2j3FHOUGiIbn83WWb15g7KcjPY0pCjunXRZsmgta8bWZYRQqhyDEdOHk1bN6t2jFRzuuizFgIQ665K2D4LZrnY8GIjIX8EvWlSUpFmFH1REUKvT/rE6DG9UhTAbDSQ5x+kU68jHFPX96x2pktFlpVWs4uRJMXo9veoN0pX27qYak4Xk62MgF6CBMToxima5UKOybTsSMcATBWh0aCvKE8L+pEgKzhEVBK0BFS2LiahHV1MSDR0qVsOcWeNxug2qJciqbZ1EVLL6SJJBoJmI9Jg4uYdcsvsaPUSTem2dIeE4Qg4XdKCDmTKo9bFPnVmpDONmWToM9R3uoyGdNUPRZFV7Cq0O0Z3elsXU83pErE60Q71J2x/Gp1EvteZrqMfIgavl0h7O9FB9frw7k1a0AG3XqnR7mhRZ8QphFAyXZIU0tVodBLpUk9sPSWKoHd3qbeIKRnWxbjTpWEwcWWKI0nMnos+EIBI4hq2FM5y0t8xwlCvOj0DUpkxp0sSR+lpQQeys+xoYzI1KrZWK3OUqe50sRl1uA1CcbqoGNJls1uQZB39KtW3ITnWxbjTZddAijSNdpQggGhv4uq2hbOVGID0KH3qGCpGBT2JTpe0oAMObwX5fpmmUES1Y5TbyxkIDtATUHdpflmmiRZrFiGVrYsGycLQiLqXko7cPFVLLqnmdJEyZwIQ6vwwYft05Vmw2PVpQT8EdPl5SGZzUidG04IO5JdXku8P0KbRqHaMpGW65DuVRhcqO10sRhuBsDqdnuI4c/NVTV1MNafL7hjdLQnbpxCCglkumrf3IavkAktVhCQpzS7Sgp5czEYjHr+PVoNetWMkK9OlPCeDIb2Frmb1ShUAGRl2IvgJqxmjq7J1EVLL6WJ0zycmEltyASV9MTAcpqspHQMwVZLtdEkL+ijZQR+DOi2DEXVidLNMWdh0NvWdLtmK02VXp7qj57EY3Sb1Jl+TZV1MFaeLzpClxOj2J7bcVjDTCaTr6IeCwesl2tNDpEfdUmuctKCPkhVVLIv1A+pYF4UQScl0iYd01Q+qNx8AkJ2rrHhtbVIv5z1ZqYup4nRRYnStaAcTG01tsRvI9FjTfvRDYMzpkqSJ0bSgj5KlU+qD1Q2JaRIwERWOCtUXF3mcJvTINGIk5lev40x+kWJd7FIppRL2sC6mnS6TJmJzoxtOfGmkcJaTtpp+wiF1rmBTlTGnS5LKLmlBHyXPZQOgukO9S6Myexm9gV56A+pdumokQZF5NHWxRb0Pp9wCN8iCXhVjdMesiyp60VPN6SLbPejCEeRAYh1IhbNcxCIybTv7E7rfVEebnYVktydtYjQt6KPklZeQEZZp8qtXSx2bGFV7gZHLRLPK1kWtVoMOk6oxuqC+dTHVnC7CpbzHwl0bErrfPK8DjVaiMV1HnxJCCAxJdLqkBX2U/MpZeEbCtInpb12sKMik3ZLJSKO61kWT3spwQOUYXZWti5BaTheNezRGtyOxgq7Ta8irsKf7jB4CcaeL2o3iIS3oY5iMBjz+Edp16lkXc8w5WHQW9a2LhW6ikoa6RnV7jVjNGYRi6rppHLl5SbEuporTRZ+9AIBoz/aE77twlouelmGGB6ZNQ7KjAoPXS2xwkEin+r1/0oK+BzlBH20mPVGVPkmFEJTby9VfXDRqXaztVKdPahyHw0FMRBjoVW+U7szNB9S3LqaK08XonE1EI5D76hK+78JZSgxAc1Xa7TIVxpwuO9Qvu6QFfQ+yIn7CkqBtWL0gojJHmeq56PGQrjqfutbFrLEY3eltXUwlp4ukMRI0GZAGEv96uQusGK06mralyy5TIZkhXWlB34MsjRI5u61ePXdIub2cnkAP/YF+1Y6RYdSRKcI0hHWq1u1yRmN021sS63vek2RYF1PN6RK22tH6Ej+KFpKgcKaTpqrepNSDUwWt04nG7U7KxGha0PcgP8MMQFWzeo0bktXsotQMzSYX0W4VY3RLcwB1Y3STYV1MNadLNCMH/fAwqCC6BbNcjAyE6G1Vd+4k1UiW0yUt6HtQXlKAJMs0+dQruSQr06XUZaLJlq1qf1G7w4oka+nv71ftGDBqXexQN5smlZwuOIrQxGSig4m3rcbr6OkYgKlhqPASrKlRtfEMpAV9HIWzZ5Drj9Emq/ey5FpyMWlNqotHhcfFkN5CZ526Mbp6ycLQsLoxus7cfPrbWlW3LqaK00XKrAQg2LEu4fu2uYw4cszpGIApYvBWII+MEG5V70oT0oI+DpPJTH4gQLvOoNoxJCFRZlc/08VbobhDdjWoa5VSYnRVdtMkybqYKk4XrXseAJHuzarsv3C2i9YdfUTD6o42U4mxCACVnS5pQd+LXP8wrUb1BB0U8VB9cVGekpBXo7J1McNmJ4yfiEoplZAk66JdKYWlgtPFkLMIgFjPDlX2XzjLRSQco61W3VXCqYTBqzip1Ha6pAV9L3IifnoNWobV7F7kKKfT38lgSL1ShcdpQidHqRtS17rocrlAyLQ3qed0cYwKuprWxVJ7KZKQVP+gTQY6SwEhnQT96lxteCodSJJI19GngMZmQ5uXp/rEaFrQ9yJbKAK4TcXac3w0qKZ4aCRBoQjQENaqdgyA7Fw3AK1N6pV2MrKyVbcuxp0uaq8RSAZCCEIWC5pBdf4meqOWnLKMdAzAFDFUqO90SQv6XuRZjYC6XvQyRxmgfkhXiQkadXZiQfWWasdjdDtVjNFNhnURSEpefbII21zofOqVRApnuehsHCLgm/6TyMnC4PUSqq1Fjqh31ZwW9L2oKMwFoHFAPZ+tx+rBqDGqbl0sc5mUkC4VrYt5hVkgQ2+PuqO1ZFgXKxwVKeN0kTPy0fuDyNGQKvsvnOUCGZqq0qP0yWLwepFDIUKN6l39pwV9L2bOnoE5ItOmYo6/JCRK7aXqT4x6XEQlDbXVjaodQ6vTosXE4KC6E2TJsi6mitMFZykSEFLJ6ZJdbENv0qbLLlPAUDE6Mapi2SUt6Hthtlrx+EO0a9V3uqhdr02WddGkUz9GN25dHBnoV+0YqeR00bhnAxDu/FCV/UsaiYIZTpq296VjACaJobwMhCC4Ky3oSSXPP0K70ajqMcod5XSMdOALqWcrrPAqwVa1neqKrdWcQVDlGN24dbGvTb25jVRyuuizTwAg0r1FtWMUznYx1BtgoFO9VoephGQ2oyssVLW/aFrQJyAnHKDFZCCm4jLdMaeLipkudpMeV3iY2iF1+0AqMbphhlScd0iGdTGVnC6G7AXEALlXvfdX4SxlrUPavjh51Ha6TErQhRCrhBDVQohdQojvTHC/QQjx8Oj97wkhShJ+pkkkWw4T1AhqVGwYnax2dEVJsC664zG69erF6CbDugip43SRtGZCRh1iQL33sD3LTIbbmBb0KWDwegk1NBALqTNZfVBBF0JogLuA1cBs4AohxOy9Nrsa6JNluQL4LfCLRJ9oMvGYdQBs2VWv3jGsHvSSXnXxKDXJNGlt6sbo5mcB6sboKtbFHNWti6nkdAlZM9AOqWcnBSV9saW6j1g0HQMwGQxeL0QihOrqVdn/ZEboS4BdsizXyrIcAh4CLtxrmwuBB0Z//jewXAghEneayaV8dLFMQ696Kzk1koZSe2lSUhcH9Ra6WtSbGC0oUT9GF5SySzJSF1PF6RK1ZaEfVjf6oWiWi1AgSke9uvM0qcJYBIBKZZfJCLoH2NM42Tx624TbyLIcAQaAzL13JIS4VgixXgixvqtLvdHc4XL8cZWc0tWP06Bef1GAFcUrmJ2598VOYlkwu5Dl0Q4CfvUWFzkyM3AZPThdDtWOAVBy/EI8M9R9vWa7ZrOieIWqx0gWovQ0RvJKkWPqXW14ZjgpmpPJ9B2+JRd9aSnWM85AY7ersn9xsEtxIcQlwCpZlr84+vtVwImyLF+/xzZbRrdpHv29ZnSb/V7vLVq0SF6/fn0CnkKaNGnSHDsIIT6QZXnRRPdNZoTeAhTu8XvB6G0TbiOE0AJ2QN3r7zRp0qRJM47JCPo6wCuEKBVC6IHLgaf22uYp4LOjP18CvCKnVxukSZMmTVI5qJ9NluWIEOJ64AVAA/xVluWtQojbgPWyLD8F/AX4uxBiF9CLIvpp0qRJkyaJTMqgLMvys8Cze932wz1+DgCfTOyppUmTJk2aqZBeKZomTZo0KUJa0NOkSZMmRUgLepo0adKkCGlBT5MmTZoU4aALi1Q7sBBdwKGur3YD6oZUJI/0czn6SJXnAenncrRyOM+lWJblrInu+P/t3U+IVWUYx/HvjwxKi7SNlAa2CCOkMlxoggv/gJSY+4qillEWQSjtQzCkIChC/AMOgoxGEBgOJrSpoCwsnciFYVOj40aNXFT4c3HegWGmodM5x977Hp4PDPfOXdzze5h7nznnvfecJ1tDb0PS17OdKVWaqGXw9KUOiFoG1c2qJZZcQgihJ6KhhxBCT5Ta0D/MHaBDUcvg6UsdELUMqptSS5Fr6CGEEGYqdQ89hBDCNNHQQwihJ4pr6P82sLoUku6TdELSGUmnJW3NnakNSbdI+lbSJ7mztCFpvqRhST9KGpW0KnempiS9ll5bP0g6KOm23JnqkrRH0kQanjP52N2SRiSdTbcLcmasY5Y6dqbX1ylJH0ma39X2imroNQdWl+Jv4HXbDwErgZcKrgVgKzCaO0QH3gU+tf0g8AiF1iRpEfAKsML2MqpLX5d0Wet9wMZpj20Djtt+ADiefh90+5hZxwiwzPbDwE/A9q42VlRDp97A6iLYHrd9Mt3/napxTJ/VWgRJi4Engd25s7Qh6S5gDdX1/bH9p+3LWUO1Mwe4PU0Rmwv8ljlPbbY/p5qtMNXUYfT7gS3/Z6Ym/qkO28fS7GWAL6mmwHWitIZeZ2B1cSQtAZYDX2WO0tQ7wBvA9cw52rofuATsTctHuyXNyx2qCdu/Am8D54Fx4IrtY3lTtbbQ9ni6fwFYmDNMR14Ajnb1ZKU19N6RdAdwGHjV9tXcef4rSZuACdvf5M7SgTnAY8D7tpcDf1DGYf0MaX35Kap/UvcC8yQ9kzdVd9KIy6K/cy3pTaql16GunrO0hl5nYHUxJN1K1cyHbB/Jnaeh1cBmST9TLYGtlXQgb6TGxoAx25NHSsNUDb5E64Fzti/Z/gs4AjyeOVNbFyXdA5BuJzLnaUzS88Am4Oku5y+X1tDrDKwugiRRrdWO2t6VO09TtrfbXmx7CdXf4zPbRe4J2r4A/CJpaXpoHXAmY6Q2zgMrJc1Nr7V1FPoB7xRTh9E/B3ycMUtjkjZSLVFutn2ty+cuqqGnDxImB1aPAodsn86bqrHVwLNUe7TfpZ8ncocKvAwMSToFPAq8lTdOM+koYxg4CXxP9V4v5tR5SQeBL4ClksYkvQjsADZIOkt1BLIjZ8Y6ZqnjPeBOYCS97z/obHtx6n8IIfRDUXvoIYQQZhcNPYQQeiIaeggh9EQ09BBC6Ilo6CGE0BPR0EMIoSeioYcQQk/cAJAmOppr2MjWAAAAAElFTkSuQmCC\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Keras Code",
+ "language": "python",
+ "name": "dswipython"
+ },
+ "language_info": {
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [],
+ "description": "",
+ "notebookId": "130008",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "278px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- ],
- "source": [
- "for _, user_df in sub_user_info.groupby('user_id'):\n",
- " item_sim_list = get_item_sim_list(user_df)\n",
- " plt.plot(item_sim_list)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "这里由于对词向量的训练迭代次数不是很多,所以看到的可视化结果不是很准确,可以训练更多次来观察具体的现象。"
- ]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 总结\n",
- "\n",
- "通过数据分析的过程, 我们目前可以得到以下几点重要的信息, 这个对于我们进行后面的特征制作和分析非常有帮助:\n",
- "1. 训练集和测试集的用户id没有重复,也就是测试集里面的用户模型是没有见过的\n",
- "2. 训练集中用户最少的点击文章数是2, 而测试集里面用户最少的点击文章数是1\n",
- "3. 用户对于文章存在重复点击的情况, 但这个都存在于训练集里面\n",
- "4. 同一用户的点击环境存在不唯一的情况,后面做这部分特征的时候可以采用统计特征\n",
- "5. 用户点击文章的次数有很大的区分度,后面可以根据这个制作衡量用户活跃度的特征\n",
- "6. 文章被用户点击的次数也有很大的区分度,后面可以根据这个制作衡量文章热度的特征\n",
- "7. 用户看的新闻,相关性是比较强的,所以往往我们判断用户是否对某篇文章感兴趣的时候, 在很大程度上会和他历史点击过的文章有关\n",
- "8. 用户点击的文章字数有比较大的区别, 这个可以反映用户对于文章字数的区别\n",
- "9. 用户点击过的文章主题也有很大的区别, 这个可以反映用户的主题偏好\n",
- "10.不同用户点击文章的时间差也会有所区别, 这个可以反映用户对于文章时效性的偏好\n",
- "\n",
- "所以根据上面的一些分析,可以更好的帮助我们后面做好特征工程, 充分挖掘数据的隐含信息。"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Keras Code",
- "language": "python",
- "name": "dswipython"
- },
- "language_info": {
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [],
- "description": "",
- "notebookId": "130008",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "278px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
- },
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
- }
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git "a/docs/ch03/ch3.1/jupyter/\347\211\271\345\276\201\345\267\245\347\250\213.ipynb" "b/docs/ch03/ch3.1/jupyter/\347\211\271\345\276\201\345\267\245\347\250\213.ipynb"
index f4e21cabc..d74eed156 100644
--- "a/docs/ch03/ch3.1/jupyter/\347\211\271\345\276\201\345\267\245\347\250\213.ipynb"
+++ "b/docs/ch03/ch3.1/jupyter/\347\211\271\345\276\201\345\267\245\347\250\213.ipynb"
@@ -1,1772 +1,1772 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 制作特征和标签, 转成监督学习问题\n",
- "我们先捋一下基于原始的给定数据, 有哪些特征可以直接利用:\n",
- "1. 文章的自身特征, category_id表示这文章的类型, created_at_ts表示文章建立的时间, 这个关系着文章的时效性, words_count是文章的字数, 一般字数太长我们不太喜欢点击, 也不排除有人就喜欢读长文。\n",
- "2. 文章的内容embedding特征, 这个召回的时候用过, 这里可以选择使用, 也可以选择不用, 也可以尝试其他类型的embedding特征, 比如W2V等\n",
- "3. 用户的设备特征信息\n",
- "\n",
- "上面这些直接可以用的特征, 待做完特征工程之后, 直接就可以根据article_id或者是user_id把这些特征加入进去。 但是我们需要先基于召回的结果, 构造一些特征,然后制作标签,形成一个监督学习的数据集。 \n",
- "构造监督数据集的思路, 根据召回结果, 我们会得到一个{user_id: [可能点击的文章列表]}形式的字典。 那么我们就可以对于每个用户, 每篇可能点击的文章构造一个监督测试集, 比如对于用户user1, 假设得到的他的召回列表{user1: [item1, item2, item3]}, 我们就可以得到三行数据(user1, item1), (user1, item2), (user1, item3)的形式, 这就是监督测试集时候的前两列特征。 \n",
- "\n",
- "构造特征的思路是这样, 我们知道每个用户的点击文章是与其历史点击的文章信息是有很大关联的, 比如同一个主题, 相似等等。 所以特征构造这块很重要的一系列特征**是要结合用户的历史点击文章信息**。我们已经得到了每个用户及点击候选文章的两列的一个数据集, 而我们的目的是要预测最后一次点击的文章, 比较自然的一个思路就是和其最后几次点击的文章产生关系, 这样既考虑了其历史点击文章信息, 又得离最后一次点击较近,因为新闻很大的一个特点就是注重时效性。 往往用户的最后一次点击会和其最后几次点击有很大的关联。 所以我们就可以对于每个候选文章, 做出与最后几次点击相关的特征如下:\n",
- "1. 候选item与最后几次点击的相似性特征(embedding内积) --- 这个直接关联用户历史行为\n",
- "2. 候选item与最后几次点击的相似性特征的统计特征 --- 统计特征可以减少一些波动和异常\n",
- "3. 候选item与最后几次点击文章的字数差的特征 --- 可以通过字数看用户偏好\n",
- "4. 候选item与最后几次点击的文章建立的时间差特征 --- 时间差特征可以看出该用户对于文章的实时性的偏好 \n",
- "\n",
- "\n",
- "还需要考虑一下\n",
- "**5. 如果使用了youtube召回的话, 我们还可以制作用户与候选item的相似特征**\n",
- "\n",
- "\n",
- "\n",
- "当然, 上面只是提供了一种基于用户历史行为做特征工程的思路, 大家也可以思维风暴一下,尝试一些其他的特征。 下面我们就实现上面的这些特征的制作, 下面的逻辑是这样:\n",
- "1. 我们首先获得用户的最后一次点击操作和用户的历史点击, 这个基于我们的日志数据集做\n",
- "2. 基于用户的历史行为制作特征, 这个会用到用户的历史点击表, 最后的召回列表, 文章的信息表和embedding向量\n",
- "3. 制作标签, 形成最后的监督学习数据集"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:00.341709Z",
- "start_time": "2020-11-17T09:06:58.723900Z"
- },
- "cell_style": "center",
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import pandas as pd\n",
- "import pickle\n",
- "from tqdm import tqdm\n",
- "import gc, os\n",
- "import logging\n",
- "import time\n",
- "import lightgbm as lgb\n",
- "from gensim.models import Word2Vec\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# df节省内存函数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:02.411005Z",
- "start_time": "2020-11-17T09:07:02.397830Z"
- }
- },
- "outputs": [],
- "source": [
- "# 节省内存的一个函数\n",
- "# 减少内存\n",
- "def reduce_mem(df):\n",
- " starttime = time.time()\n",
- " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
- " start_mem = df.memory_usage().sum() / 1024**2\n",
- " for col in df.columns:\n",
- " col_type = df[col].dtypes\n",
- " if col_type in numerics:\n",
- " c_min = df[col].min()\n",
- " c_max = df[col].max()\n",
- " if pd.isnull(c_min) or pd.isnull(c_max):\n",
- " continue\n",
- " if str(col_type)[:3] == 'int':\n",
- " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
- " df[col] = df[col].astype(np.int8)\n",
- " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
- " df[col] = df[col].astype(np.int16)\n",
- " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
- " df[col] = df[col].astype(np.int32)\n",
- " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
- " df[col] = df[col].astype(np.int64)\n",
- " else:\n",
- " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
- " df[col] = df[col].astype(np.float16)\n",
- " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
- " df[col] = df[col].astype(np.float32)\n",
- " else:\n",
- " df[col] = df[col].astype(np.float64)\n",
- " end_mem = df.memory_usage().sum() / 1024**2\n",
- " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
- " 100*(start_mem-end_mem)/start_mem,\n",
- " (time.time()-starttime)/60))\n",
- " return df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:05.031436Z",
- "start_time": "2020-11-17T09:07:05.026822Z"
- }
- },
- "outputs": [],
- "source": [
- "data_path = './data_raw/'\n",
- "save_path = './temp_results/'"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 数据读取\n",
- "\n",
- "## 训练和验证集的划分\n",
- "\n",
- "划分训练和验证集的原因是为了在线下验证模型参数的好坏,为了完全模拟测试集,我们这里就在训练集中抽取部分用户的所有信息来作为验证集。提前做训练验证集划分的好处就是可以分解制作排序特征时的压力,一次性做整个数据集的排序特征可能时间会比较长。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:07.230308Z",
- "start_time": "2020-11-17T09:07:07.221081Z"
- }
- },
- "outputs": [],
- "source": [
- "# all_click_df指的是训练集\n",
- "# sample_user_nums 采样作为验证集的用户数量\n",
- "def trn_val_split(all_click_df, sample_user_nums):\n",
- " all_click = all_click_df\n",
- " all_user_ids = all_click.user_id.unique()\n",
- " \n",
- " # replace=True表示可以重复抽样,反之不可以\n",
- " sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False) \n",
- " \n",
- " click_val = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
- " click_trn = all_click[~all_click['user_id'].isin(sample_user_ids)]\n",
- " \n",
- " # 将验证集中的最后一次点击给抽取出来作为答案\n",
- " click_val = click_val.sort_values(['user_id', 'click_timestamp'])\n",
- " val_ans = click_val.groupby('user_id').tail(1)\n",
- " \n",
- " click_val = click_val.groupby('user_id').apply(lambda x: x[:-1]).reset_index(drop=True)\n",
- " \n",
- " # 去除val_ans中某些用户只有一个点击数据的情况,如果该用户只有一个点击数据,又被分到ans中,\n",
- " # 那么训练集中就没有这个用户的点击数据,出现用户冷启动问题,给自己模型验证带来麻烦\n",
- " val_ans = val_ans[val_ans.user_id.isin(click_val.user_id.unique())] # 保证答案中出现的用户再验证集中还有\n",
- " click_val = click_val[click_val.user_id.isin(val_ans.user_id.unique())]\n",
- " \n",
- " return click_trn, click_val, val_ans"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 获取历史点击和最后一次点击"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:19.202550Z",
- "start_time": "2020-11-17T09:07:19.195766Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取当前数据的历史点击和最后一次点击\n",
- "def get_hist_and_last_click(all_click):\n",
- " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
- " click_last_df = all_click.groupby('user_id').tail(1)\n",
- "\n",
- " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
- " def hist_func(user_df):\n",
- " if len(user_df) == 1:\n",
- " return user_df\n",
- " else:\n",
- " return user_df[:-1]\n",
- "\n",
- " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
- "\n",
- " return click_hist_df, click_last_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取训练、验证及测试集"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:21.181211Z",
- "start_time": "2020-11-17T09:07:21.171338Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_trn_val_tst_data(data_path, offline=True):\n",
- " if offline:\n",
- " click_trn_data = pd.read_csv(data_path+'train_click_log.csv') # 训练集用户点击日志\n",
- " click_trn_data = reduce_mem(click_trn_data)\n",
- " click_trn, click_val, val_ans = trn_val_split(click_trn_data, sample_user_nums)\n",
- " else:\n",
- " click_trn = pd.read_csv(data_path+'train_click_log.csv')\n",
- " click_trn = reduce_mem(click_trn)\n",
- " click_val = None\n",
- " val_ans = None\n",
- " \n",
- " click_tst = pd.read_csv(data_path+'testA_click_log.csv')\n",
- " \n",
- " return click_trn, click_val, click_tst, val_ans"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取召回列表"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:23.210604Z",
- "start_time": "2020-11-17T09:07:23.203652Z"
- }
- },
- "outputs": [],
- "source": [
- "# 返回多路召回列表或者单路召回\n",
- "def get_recall_list(save_path, single_recall_model=None, multi_recall=False):\n",
- " if multi_recall:\n",
- " return pickle.load(open(save_path + 'final_recall_items_dict.pkl', 'rb'))\n",
- " \n",
- " if single_recall_model == 'i2i_itemcf':\n",
- " return pickle.load(open(save_path + 'itemcf_recall_dict.pkl', 'rb'))\n",
- " elif single_recall_model == 'i2i_emb_itemcf':\n",
- " return pickle.load(open(save_path + 'itemcf_emb_dict.pkl', 'rb'))\n",
- " elif single_recall_model == 'user_cf':\n",
- " return pickle.load(open(save_path + 'youtubednn_usercf_dict.pkl', 'rb'))\n",
- " elif single_recall_model == 'youtubednn':\n",
- " return pickle.load(open(save_path + 'youtube_u2i_dict.pkl', 'rb'))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取各种Embedding"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "##### Word2Vec训练及gensim的使用\n",
- "\n",
- "Word2Vec主要思想是:一个词的上下文可以很好的表达出词的语义。通过无监督学习产生词向量的方式。word2vec中有两个非常经典的模型:skip-gram和cbow。\n",
- "\n",
- "- skip-gram:已知中心词预测周围词。\n",
- "- cbow:已知周围词预测中心词。\n",
- "![image-20201106225233086](http://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)\n",
- "\n",
- "在使用gensim训练word2vec的时候,有几个比较重要的参数\n",
- "- size: 表示词向量的维度。\n",
- "- window:决定了目标词会与多远距离的上下文产生关系。\n",
- "- sg: 如果是0,则是CBOW模型,是1则是Skip-Gram模型。\n",
- "- workers: 表示训练时候的线程数量\n",
- "- min_count: 设置最小的\n",
- "- iter: 训练时遍历整个数据集的次数\n",
- "\n",
- "**注意**\n",
- "1. 训练的时候输入的语料库一定要是字符组成的二维数组,如:[['北', '京', '你', '好'], ['上', '海', '你', '好']]\n",
- "2. 使用模型的时候有一些默认值,可以通过在Jupyter里面通过`Word2Vec??`查看\n",
- "\n",
- "\n",
- "下面是个简单的测试样例:\n",
- "```\n",
- "from gensim.models import Word2Vec\n",
- "doc = [['30760', '157507'],\n",
- " ['289197', '63746'],\n",
- " ['36162', '168401'],\n",
- " ['50644', '36162']]\n",
- "w2v = Word2Vec(docs, size=12, sg=1, window=2, seed=2020, workers=2, min_count=1, iter=1)\n",
- "\n",
- "# 查看'30760'表示的词向量\n",
- "w2v['30760']\n",
- "```\n",
- "\n",
- "skip-gram和cbow的详细原理可以参考下面的博客:\n",
- "- [word2vec原理(一) CBOW与Skip-Gram模型基础](https://www.cnblogs.com/pinard/p/7160330.html) \n",
- "- [word2vec原理(二) 基于Hierarchical Softmax的模型](https://www.cnblogs.com/pinard/p/7160330.html) \n",
- "- [word2vec原理(三) 基于Negative Sampling的模型](https://www.cnblogs.com/pinard/p/7249903.html) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:26.676173Z",
- "start_time": "2020-11-17T09:07:26.667926Z"
- }
- },
- "outputs": [],
- "source": [
- "def trian_item_word2vec(click_df, embed_size=64, save_name='item_w2v_emb.pkl', split_char=' '):\n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " # 只有转换成字符串才可以进行训练\n",
- " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
- " # 转换成句子的形式\n",
- " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
- " docs = docs['click_article_id'].values.tolist()\n",
- "\n",
- " # 为了方便查看训练的进度,这里设定一个log信息\n",
- " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
- "\n",
- " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
- " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=1)\n",
- " \n",
- " # 保存成字典的形式\n",
- " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
- " pickle.dump(item_w2v_emb_dict, open(save_path + 'item_w2v_emb.pkl', 'wb'))\n",
- " \n",
- " return item_w2v_emb_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:27.285690Z",
- "start_time": "2020-11-17T09:07:27.276646Z"
- }
- },
- "outputs": [],
- "source": [
- "# 可以通过字典查询对应的item的Embedding\n",
- "def get_embedding(save_path, all_click_df):\n",
- " if os.path.exists(save_path + 'item_content_emb.pkl'):\n",
- " item_content_emb_dict = pickle.load(open(save_path + 'item_content_emb.pkl', 'rb'))\n",
- " else:\n",
- " print('item_content_emb.pkl 文件不存在...')\n",
- " \n",
- " # w2v Embedding是需要提前训练好的\n",
- " if os.path.exists(save_path + 'item_w2v_emb.pkl'):\n",
- " item_w2v_emb_dict = pickle.load(open(save_path + 'item_w2v_emb.pkl', 'rb'))\n",
- " else:\n",
- " item_w2v_emb_dict = trian_item_word2vec(all_click_df)\n",
- " \n",
- " if os.path.exists(save_path + 'item_youtube_emb.pkl'):\n",
- " item_youtube_emb_dict = pickle.load(open(save_path + 'item_youtube_emb.pkl', 'rb'))\n",
- " else:\n",
- " print('item_youtube_emb.pkl 文件不存在...')\n",
- " \n",
- " if os.path.exists(save_path + 'user_youtube_emb.pkl'):\n",
- " user_youtube_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
- " else:\n",
- " print('user_youtube_emb.pkl 文件不存在...')\n",
- " \n",
- " return item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取文章信息"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:28.391797Z",
- "start_time": "2020-11-17T09:07:28.386650Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_article_info_df():\n",
- " article_info_df = pd.read_csv(data_path + 'articles.csv')\n",
- " article_info_df = reduce_mem(article_info_df)\n",
- " \n",
- " return article_info_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:32.362045Z",
- "start_time": "2020-11-17T09:07:29.490413Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "-- Mem. usage decreased to 23.34 Mb (69.4% reduction),time spend:0.00 min\n"
- ]
- }
- ],
- "source": [
- "# 这里offline的online的区别就是验证集是否为空\n",
- "click_trn, click_val, click_tst, val_ans = get_trn_val_tst_data(data_path, offline=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:10.378966Z",
- "start_time": "2020-11-17T09:07:32.468580Z"
- }
- },
- "outputs": [],
- "source": [
- "click_trn_hist, click_trn_last = get_hist_and_last_click(click_trn)\n",
- "\n",
- "if click_val is not None:\n",
- " click_val_hist, click_val_last = click_val, val_ans\n",
- "else:\n",
- " click_val_hist, click_val_last = None, None\n",
- " \n",
- "click_tst_hist = click_tst"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 对训练数据做负采样\n",
- "\n",
- "通过召回我们将数据转换成三元组的形式(user1, item1, label)的形式,观察发现正负样本差距极度不平衡,我们可以先对负样本进行下采样,下采样的目的一方面缓解了正负样本比例的问题,另一方面也减小了我们做排序特征的压力,我们在做负采样的时候又有哪些东西是需要注意的呢?\n",
- "\n",
- "1. 只对负样本进行下采样(如果有比较好的正样本扩充的方法其实也是可以考虑的)\n",
- "2. 负采样之后,保证所有的用户和文章仍然出现在采样之后的数据中\n",
- "3. 下采样的比例可以根据实际情况人为的控制\n",
- "4. 做完负采样之后,更新此时新的用户召回文章列表,因为后续做特征的时候可能用到相对位置的信息。\n",
- "\n",
- "其实负采样也可以留在后面做完特征在进行,这里由于做排序特征太慢了,所以把负采样的环节提到前面了。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:36.096678Z",
- "start_time": "2020-11-17T09:11:36.090911Z"
- }
- },
- "outputs": [],
- "source": [
- "# 将召回列表转换成df的形式\n",
- "def recall_dict_2_df(recall_list_dict):\n",
- " df_row_list = [] # [user, item, score]\n",
- " for user, recall_list in tqdm(recall_list_dict.items()):\n",
- " for item, score in recall_list:\n",
- " df_row_list.append([user, item, score])\n",
- " \n",
- " col_names = ['user_id', 'sim_item', 'score']\n",
- " recall_list_df = pd.DataFrame(df_row_list, columns=col_names)\n",
- " \n",
- " return recall_list_df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:37.668844Z",
- "start_time": "2020-11-17T09:11:37.659774Z"
- }
- },
- "outputs": [],
- "source": [
- "# 负采样函数,这里可以控制负采样时的比例, 这里给了一个默认的值\n",
- "def neg_sample_recall_data(recall_items_df, sample_rate=0.001):\n",
- " pos_data = recall_items_df[recall_items_df['label'] == 1]\n",
- " neg_data = recall_items_df[recall_items_df['label'] == 0]\n",
- " \n",
- " print('pos_data_num:', len(pos_data), 'neg_data_num:', len(neg_data), 'pos/neg:', len(pos_data)/len(neg_data))\n",
- " \n",
- " # 分组采样函数\n",
- " def neg_sample_func(group_df):\n",
- " neg_num = len(group_df)\n",
- " sample_num = max(int(neg_num * sample_rate), 1) # 保证最少有一个\n",
- " sample_num = min(sample_num, 5) # 保证最多不超过5个,这里可以根据实际情况进行选择\n",
- " return group_df.sample(n=sample_num, replace=True)\n",
- " \n",
- " # 对用户进行负采样,保证所有用户都在采样后的数据中\n",
- " neg_data_user_sample = neg_data.groupby('user_id', group_keys=False).apply(neg_sample_func)\n",
- " # 对文章进行负采样,保证所有文章都在采样后的数据中\n",
- " neg_data_item_sample = neg_data.groupby('sim_item', group_keys=False).apply(neg_sample_func)\n",
- " \n",
- " # 将上述两种情况下的采样数据合并\n",
- " neg_data_new = neg_data_user_sample.append(neg_data_item_sample)\n",
- " # 由于上述两个操作是分开的,可能将两个相同的数据给重复选择了,所以需要对合并后的数据进行去重\n",
- " neg_data_new = neg_data_new.sort_values(['user_id', 'score']).drop_duplicates(['user_id', 'sim_item'], keep='last')\n",
- " \n",
- " # 将正样本数据合并\n",
- " data_new = pd.concat([pos_data, neg_data_new], ignore_index=True)\n",
- " \n",
- " return data_new"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:39.481715Z",
- "start_time": "2020-11-17T09:11:39.475144Z"
- }
- },
- "outputs": [],
- "source": [
- "# 召回数据打标签\n",
- "def get_rank_label_df(recall_list_df, label_df, is_test=False):\n",
- " # 测试集是没有标签了,为了后面代码同一一些,这里直接给一个负数替代\n",
- " if is_test:\n",
- " recall_list_df['label'] = -1\n",
- " return recall_list_df\n",
- " \n",
- " label_df = label_df.rename(columns={'click_article_id': 'sim_item'})\n",
- " recall_list_df_ = recall_list_df.merge(label_df[['user_id', 'sim_item', 'click_timestamp']], \\\n",
- " how='left', on=['user_id', 'sim_item'])\n",
- " recall_list_df_['label'] = recall_list_df_['click_timestamp'].apply(lambda x: 0.0 if np.isnan(x) else 1.0)\n",
- " del recall_list_df_['click_timestamp']\n",
- " \n",
- " return recall_list_df_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:41.555566Z",
- "start_time": "2020-11-17T09:11:41.546766Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_user_recall_item_label_df(click_trn_hist, click_val_hist, click_tst_hist,click_trn_last, click_val_last, recall_list_df):\n",
- " # 获取训练数据的召回列表\n",
- " trn_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_trn_hist['user_id'].unique())]\n",
- " # 训练数据打标签\n",
- " trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False)\n",
- " # 训练数据负采样\n",
- " trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df)\n",
- " \n",
- " if click_val is not None:\n",
- " val_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_val_hist['user_id'].unique())]\n",
- " val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False)\n",
- " val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df)\n",
- " else:\n",
- " val_user_item_label_df = None\n",
- " \n",
- " # 测试数据不需要进行负采样,直接对所有的召回商品进行打-1标签\n",
- " tst_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_tst_hist['user_id'].unique())]\n",
- " tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None, is_test=True)\n",
- " \n",
- " return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:23:35.357045Z",
- "start_time": "2020-11-17T17:23:12.378284Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [00:12<00:00, 20689.39it/s]\n"
- ]
- }
- ],
- "source": [
- "# 读取召回列表\n",
- "recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf') # 这里只选择了单路召回的结果,也可以选择多路召回结果\n",
- "# 将召回数据转换成df\n",
- "recall_list_df = recall_dict_2_df(recall_list_dict)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:29:04.598214Z",
- "start_time": "2020-11-17T17:23:40.001052Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "pos_data_num: 64190 neg_data_num: 1935810 pos/neg: 0.03315924600038227\n"
- ]
- }
- ],
- "source": [
- "# 给训练验证数据打标签,并负采样(这一部分时间比较久)\n",
- "trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df(click_trn_hist, \n",
- " click_val_hist, \n",
- " click_tst_hist,\n",
- " click_trn_last, \n",
- " click_val_last, \n",
- " recall_list_df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:23:11.642944Z",
- "start_time": "2020-11-17T17:23:08.475Z"
- },
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "trn_user_item_label_df.label"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 将召回数据转换成字典"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:36:22.800449Z",
- "start_time": "2020-11-17T17:36:22.794670Z"
- }
- },
- "outputs": [],
- "source": [
- "# 将最终的召回的df数据转换成字典的形式做排序特征\n",
- "def make_tuple_func(group_df):\n",
- " row_data = []\n",
- " for name, row_df in group_df.iterrows():\n",
- " row_data.append((row_df['sim_item'], row_df['score'], row_df['label']))\n",
- " \n",
- " return row_data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:40:05.991819Z",
- "start_time": "2020-11-17T17:36:26.536429Z"
- }
- },
- "outputs": [],
- "source": [
- "trn_user_item_label_tuples = trn_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
- "trn_user_item_label_tuples_dict = dict(zip(trn_user_item_label_tuples['user_id'], trn_user_item_label_tuples[0]))\n",
- "\n",
- "if val_user_item_label_df is not None:\n",
- " val_user_item_label_tuples = val_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
- " val_user_item_label_tuples_dict = dict(zip(val_user_item_label_tuples['user_id'], val_user_item_label_tuples[0]))\n",
- "else:\n",
- " val_user_item_label_tuples_dict = None\n",
- " \n",
- "tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
- "tst_user_item_label_tuples_dict = dict(zip(tst_user_item_label_tuples['user_id'], tst_user_item_label_tuples[0]))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:59:53.141560Z",
- "start_time": "2020-11-17T07:59:53.133599Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 特征工程"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 制作与用户历史行为相关特征\n",
- "对于每个用户召回的每个商品, 做特征。 具体步骤如下:\n",
- "* 对于每个用户, 获取最后点击的N个商品的item_id, \n",
- " * 对于该用户的每个召回商品, 计算与上面最后N次点击商品的相似度的和(最大, 最小,均值), 时间差特征,相似性特征,字数差特征,与该用户的相似性特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 60,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T01:07:47.268035Z",
- "start_time": "2020-11-18T01:07:47.250449Z"
- }
- },
- "outputs": [],
- "source": [
- "# 下面基于data做历史相关的特征\n",
- "def create_feature(users_id, recall_list, click_hist_df, articles_info, articles_emb, user_emb=None, N=1):\n",
- " \"\"\"\n",
- " 基于用户的历史行为做相关特征\n",
- " :param users_id: 用户id\n",
- " :param recall_list: 对于每个用户召回的候选文章列表\n",
- " :param click_hist_df: 用户的历史点击信息\n",
- " :param articles_info: 文章信息\n",
- " :param articles_emb: 文章的embedding向量, 这个可以用item_content_emb, item_w2v_emb, item_youtube_emb\n",
- " :param user_emb: 用户的embedding向量, 这个是user_youtube_emb, 如果没有也可以不用, 但要注意如果要用的话, articles_emb就要用item_youtube_emb的形式, 这样维度才一样\n",
- " :param N: 最近的N次点击 由于testA日志里面很多用户只存在一次历史点击, 所以为了不产生空值,默认是1\n",
- " \"\"\"\n",
- " \n",
- " # 建立一个二维列表保存结果, 后面要转成DataFrame\n",
- " all_user_feas = []\n",
- " i = 0\n",
- " for user_id in tqdm(users_id):\n",
- " # 该用户的最后N次点击\n",
- " hist_user_items = click_hist_df[click_hist_df['user_id']==user_id]['click_article_id'][-N:]\n",
- " \n",
- " # 遍历该用户的召回列表\n",
- " for rank, (article_id, score, label) in enumerate(recall_list[user_id]):\n",
- " # 该文章建立时间, 字数\n",
- " a_create_time = articles_info[articles_info['article_id']==article_id]['created_at_ts'].values[0]\n",
- " a_words_count = articles_info[articles_info['article_id']==article_id]['words_count'].values[0]\n",
- " single_user_fea = [user_id, article_id]\n",
- " # 计算与最后点击的商品的相似度的和, 最大值和最小值, 均值\n",
- " sim_fea = []\n",
- " time_fea = []\n",
- " word_fea = []\n",
- " # 遍历用户的最后N次点击文章\n",
- " for hist_item in hist_user_items:\n",
- " b_create_time = articles_info[articles_info['article_id']==hist_item]['created_at_ts'].values[0]\n",
- " b_words_count = articles_info[articles_info['article_id']==hist_item]['words_count'].values[0]\n",
- " \n",
- " sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id]))\n",
- " time_fea.append(abs(a_create_time-b_create_time))\n",
- " word_fea.append(abs(a_words_count-b_words_count))\n",
- " \n",
- " single_user_fea.extend(sim_fea) # 相似性特征\n",
- " single_user_fea.extend(time_fea) # 时间差特征\n",
- " single_user_fea.extend(word_fea) # 字数差特征\n",
- " single_user_fea.extend([max(sim_fea), min(sim_fea), sum(sim_fea), sum(sim_fea) / len(sim_fea)]) # 相似性的统计特征\n",
- " \n",
- " if user_emb: # 如果用户向量有的话, 这里计算该召回文章与用户的相似性特征 \n",
- " single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id]))\n",
- " \n",
- " single_user_fea.extend([score, rank, label]) \n",
- " # 加入到总的表中\n",
- " all_user_feas.append(single_user_fea)\n",
- " \n",
- " # 定义列名\n",
- " id_cols = ['user_id', 'click_article_id']\n",
- " sim_cols = ['sim' + str(i) for i in range(N)]\n",
- " time_cols = ['time_diff' + str(i) for i in range(N)]\n",
- " word_cols = ['word_diff' + str(i) for i in range(N)]\n",
- " sat_cols = ['sim_max', 'sim_min', 'sim_sum', 'sim_mean']\n",
- " user_item_sim_cols = ['user_item_sim'] if user_emb else []\n",
- " user_score_rank_label = ['score', 'rank', 'label']\n",
- " cols = id_cols + sim_cols + time_cols + word_cols + sat_cols + user_item_sim_cols + user_score_rank_label\n",
- " \n",
- " # 转成DataFrame\n",
- " df = pd.DataFrame( all_user_feas, columns=cols)\n",
- " \n",
- " return df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 61,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T01:08:17.531694Z",
- "start_time": "2020-11-18T01:08:10.754702Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
- ]
- }
- ],
- "source": [
- "article_info_df = get_article_info_df()\n",
- "all_click = click_trn.append(click_tst)\n",
- "item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, all_click)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 62,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:06:22.709350Z",
- "start_time": "2020-11-18T01:08:39.923811Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 200000/200000 [50:16<00:00, 66.31it/s] \n",
- "100%|██████████| 50000/50000 [1:07:21<00:00, 12.37it/s]\n"
- ]
- }
- ],
- "source": [
- "# 获取训练验证及测试数据中召回列文章相关特征\n",
- "trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict, \\\n",
- " click_trn_hist, article_info_df, item_content_emb_dict)\n",
- "\n",
- "if val_user_item_label_tuples_dict is not None:\n",
- " val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict, \\\n",
- " click_val_hist, article_info_df, item_content_emb_dict)\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict, \\\n",
- " click_tst_hist, article_info_df, item_content_emb_dict)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 63,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:13:58.573422Z",
- "start_time": "2020-11-18T03:13:40.157228Z"
- }
- },
- "outputs": [],
- "source": [
- "# 保存一份省的每次都要重新跑,每次跑的时间都比较长\n",
- "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
- "\n",
- "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:14:22.838154Z",
- "start_time": "2020-11-18T03:14:22.828212Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 用户和文章特征\n",
- "### 用户相关特征\n",
- "这一块,正式进行特征工程,既要拼接上已有的特征, 也会做更多的特征出来,我们来梳理一下已有的特征和可构造特征:\n",
- "1. 文章自身的特征, 文章字数,文章创建时间, 文章的embedding (articles表中)\n",
- "2. 用户点击环境特征, 那些设备的特征(这个在df中)\n",
- "3. 对于用户和商品还可以构造的特征:\n",
- " * 基于用户的点击文章次数和点击时间构造可以表现用户活跃度的特征\n",
- " * 基于文章被点击次数和时间构造可以反映文章热度的特征\n",
- " * 用户的时间统计特征: 根据其点击的历史文章列表的点击时间和文章的创建时间做统计特征,比如求均值, 这个可以反映用户对于文章时效的偏好\n",
- " * 用户的主题爱好特征, 对于用户点击的历史文章主题进行一个统计, 然后对于当前文章看看是否属于用户已经点击过的主题\n",
- " * 用户的字数爱好特征, 对于用户点击的历史文章的字数统计, 求一个均值"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:16:37.637495Z",
- "start_time": "2020-11-14T03:16:37.618229Z"
- }
- },
- "outputs": [],
- "source": [
- "click_tst.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:09:11.675550Z",
- "start_time": "2020-11-17T02:09:10.265134Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取文章特征\n",
- "articles = pd.read_csv(data_path+'articles.csv')\n",
- "articles = reduce_mem(articles)\n",
- "\n",
- "# 日志数据,就是前面的所有数据\n",
- "if click_val is not None:\n",
- " all_data = click_trn.append(click_val)\n",
- "all_data = click_trn.append(click_tst)\n",
- "all_data = reduce_mem(all_data)\n",
- "\n",
- "# 拼上文章信息\n",
- "all_data = all_data.merge(articles, left_on='click_article_id', right_on='article_id')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:17:12.256244Z",
- "start_time": "2020-11-14T03:17:12.250452Z"
- }
- },
- "outputs": [],
- "source": [
- "all_data.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 分析一下点击时间和点击文章的次数,区分用户活跃度\n",
- "如果某个用户点击文章之间的时间间隔比较小, 同时点击的文章次数很多的话, 那么我们认为这种用户一般就是活跃用户, 当然衡量用户活跃度的方式可能多种多样, 这里我们只提供其中一种,我们写一个函数, 得到可以衡量用户活跃度的特征,逻辑如下:\n",
- "1. 首先根据用户user_id分组, 对于每个用户,计算点击文章的次数, 两两点击文章时间间隔的均值\n",
- "2. 把点击次数取倒数和时间间隔的均值统一归一化,然后两者相加合并,该值越小, 说明用户越活跃\n",
- "3. 注意, 上面两两点击文章的时间间隔均值, 会出现如果用户只点击了一次的情况,这时候时间间隔均值那里会出现空值, 对于这种情况最后特征那里给个大数进行区分\n",
- "\n",
- "这个的衡量标准就是先把点击的次数取到数然后归一化, 然后点击的时间差归一化, 然后两者相加进行合并, 该值越小, 说明被点击的次数越多, 且间隔时间短。 "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:28:55.336058Z",
- "start_time": "2020-11-17T02:28:55.324332Z"
- }
- },
- "outputs": [],
- "source": [
- " def active_level(all_data, cols):\n",
- " \"\"\"\n",
- " 制作区分用户活跃度的特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " data = all_data[cols]\n",
- " data.sort_values(['user_id', 'click_timestamp'], inplace=True)\n",
- " user_act = pd.DataFrame(data.groupby('user_id', as_index=False)[['click_article_id', 'click_timestamp']].\\\n",
- " agg({'click_article_id':np.size, 'click_timestamp': {list}}).values, columns=['user_id', 'click_size', 'click_timestamp'])\n",
- " \n",
- " # 计算时间间隔的均值\n",
- " def time_diff_mean(l):\n",
- " if len(l) == 1:\n",
- " return 1\n",
- " else:\n",
- " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
- " \n",
- " user_act['time_diff_mean'] = user_act['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
- " \n",
- " # 点击次数取倒数\n",
- " user_act['click_size'] = 1 / user_act['click_size']\n",
- " \n",
- " # 两者归一化\n",
- " user_act['click_size'] = (user_act['click_size'] - user_act['click_size'].min()) / (user_act['click_size'].max() - user_act['click_size'].min())\n",
- " user_act['time_diff_mean'] = (user_act['time_diff_mean'] - user_act['time_diff_mean'].min()) / (user_act['time_diff_mean'].max() - user_act['time_diff_mean'].min()) \n",
- " user_act['active_level'] = user_act['click_size'] + user_act['time_diff_mean']\n",
- " \n",
- " user_act['user_id'] = user_act['user_id'].astype('int')\n",
- " del user_act['click_timestamp']\n",
- " \n",
- " return user_act"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:30:12.696060Z",
- "start_time": "2020-11-17T02:29:01.523837Z"
- }
- },
- "outputs": [],
- "source": [
- "user_act_fea = active_level(all_data, ['user_id', 'click_article_id', 'click_timestamp'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:28:53.996742Z",
- "start_time": "2020-11-17T02:09:18.374Z"
- }
- },
- "outputs": [],
- "source": [
- "user_act_fea.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 分析一下点击时间和被点击文章的次数, 衡量文章热度特征\n",
- "和上面同样的思路, 如果一篇文章在很短的时间间隔之内被点击了很多次, 说明文章比较热门,实现的逻辑和上面的基本一致, 只不过这里是按照点击的文章进行分组:\n",
- "1. 根据文章进行分组, 对于每篇文章的用户, 计算点击的时间间隔\n",
- "2. 将用户的数量取倒数, 然后用户的数量和时间间隔归一化, 然后相加得到热度特征, 该值越小, 说明被点击的次数越大且时间间隔越短, 文章比较热\n",
- "\n",
- "当然, 这只是给出一种判断文章热度的一种方法, 这里大家也可以头脑风暴一下"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:41:26.398567Z",
- "start_time": "2020-11-17T02:41:26.386668Z"
- }
- },
- "outputs": [],
- "source": [
- " def hot_level(all_data, cols):\n",
- " \"\"\"\n",
- " 制作衡量文章热度的特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " data = all_data[cols]\n",
- " data.sort_values(['click_article_id', 'click_timestamp'], inplace=True)\n",
- " article_hot = pd.DataFrame(data.groupby('click_article_id', as_index=False)[['user_id', 'click_timestamp']].\\\n",
- " agg({'user_id':np.size, 'click_timestamp': {list}}).values, columns=['click_article_id', 'user_num', 'click_timestamp'])\n",
- " \n",
- " # 计算被点击时间间隔的均值\n",
- " def time_diff_mean(l):\n",
- " if len(l) == 1:\n",
- " return 1\n",
- " else:\n",
- " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
- " \n",
- " article_hot['time_diff_mean'] = article_hot['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
- " \n",
- " # 点击次数取倒数\n",
- " article_hot['user_num'] = 1 / article_hot['user_num']\n",
- " \n",
- " # 两者归一化\n",
- " article_hot['user_num'] = (article_hot['user_num'] - article_hot['user_num'].min()) / (article_hot['user_num'].max() - article_hot['user_num'].min())\n",
- " article_hot['time_diff_mean'] = (article_hot['time_diff_mean'] - article_hot['time_diff_mean'].min()) / (article_hot['time_diff_mean'].max() - article_hot['time_diff_mean'].min()) \n",
- " article_hot['hot_level'] = article_hot['user_num'] + article_hot['time_diff_mean']\n",
- " \n",
- " article_hot['click_article_id'] = article_hot['click_article_id'].astype('int')\n",
- " \n",
- " del article_hot['click_timestamp']\n",
- " \n",
- " return article_hot"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:41:44.635900Z",
- "start_time": "2020-11-17T02:41:31.473032Z"
- }
- },
- "outputs": [],
- "source": [
- "article_hot_fea = hot_level(all_data, ['user_id', 'click_article_id', 'click_timestamp']) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:19:54.775290Z",
- "start_time": "2020-11-14T03:19:54.763699Z"
- }
- },
- "outputs": [],
- "source": [
- "article_hot_fea.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的系列习惯\n",
- "这个基于原来的日志表做一个类似于article的那种DataFrame, 存放用户特有的信息, 主要包括点击习惯, 爱好特征之类的\n",
- "* 用户的设备习惯, 这里取最常用的设备(众数)\n",
- "* 用户的时间习惯: 根据其点击过得历史文章的时间来做一个统计(这个感觉最好是把时间戳里的时间特征的h特征提出来,看看用户习惯一天的啥时候点击文章), 但这里先用转换的时间吧, 求个均值\n",
- "* 用户的爱好特征, 对于用户点击的历史文章主题进行用户的爱好判别, 更偏向于哪几个主题, 这个最好是multi-hot进行编码, 先试试行不\n",
- "* 用户文章的字数差特征, 用户的爱好文章的字数习惯\n",
- "\n",
- "这些就是对用户进行分组, 然后统计即可"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的设备习惯"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T04:22:48.877978Z",
- "start_time": "2020-11-17T04:22:48.872049Z"
- }
- },
- "outputs": [],
- "source": [
- "def device_fea(all_data, cols):\n",
- " \"\"\"\n",
- " 制作用户的设备特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " user_device_info = all_data[cols]\n",
- " \n",
- " # 用众数来表示每个用户的设备信息\n",
- " user_device_info = user_device_info.groupby('user_id').agg(lambda x: x.value_counts().index[0]).reset_index()\n",
- " \n",
- " return user_device_info"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T05:27:10.897473Z",
- "start_time": "2020-11-17T04:49:33.214865Z"
- }
- },
- "outputs": [],
- "source": [
- "# 设备特征(这里时间会比较长)\n",
- "device_cols = ['user_id', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 'click_referrer_type']\n",
- "user_device_info = device_fea(all_data, device_cols)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T04:20:39.765842Z",
- "start_time": "2020-11-14T04:20:39.747087Z"
- }
- },
- "outputs": [],
- "source": [
- "user_device_info.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的时间习惯"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:11:50.889905Z",
- "start_time": "2020-11-17T06:11:50.882653Z"
- }
- },
- "outputs": [],
- "source": [
- "def user_time_hob_fea(all_data, cols):\n",
- " \"\"\"\n",
- " 制作用户的时间习惯特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " user_time_hob_info = all_data[cols]\n",
- " \n",
- " # 先把时间戳进行归一化\n",
- " mm = MinMaxScaler()\n",
- " user_time_hob_info['click_timestamp'] = mm.fit_transform(user_time_hob_info[['click_timestamp']])\n",
- " user_time_hob_info['created_at_ts'] = mm.fit_transform(user_time_hob_info[['created_at_ts']])\n",
- "\n",
- " user_time_hob_info = user_time_hob_info.groupby('user_id').agg('mean').reset_index()\n",
- " \n",
- " user_time_hob_info.rename(columns={'click_timestamp': 'user_time_hob1', 'created_at_ts': 'user_time_hob2'}, inplace=True)\n",
- " return user_time_hob_info"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:31:51.646110Z",
- "start_time": "2020-11-17T06:31:51.171431Z"
- }
- },
- "outputs": [],
- "source": [
- "user_time_hob_cols = ['user_id', 'click_timestamp', 'created_at_ts']\n",
- "user_time_hob_info = user_time_hob_fea(all_data, user_time_hob_cols)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的主题爱好\n",
- "这里先把用户点击的文章属于的主题转成一个列表, 后面再总的汇总的时候单独制作一个特征, 就是文章的主题如果属于这里面, 就是1, 否则就是0。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:31:56.571088Z",
- "start_time": "2020-11-17T06:31:56.565304Z"
- }
- },
- "outputs": [],
- "source": [
- "def user_cat_hob_fea(all_data, cols):\n",
- " \"\"\"\n",
- " 用户的主题爱好\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " user_category_hob_info = all_data[cols]\n",
- " user_category_hob_info = user_category_hob_info.groupby('user_id').agg({list}).reset_index()\n",
- " \n",
- " user_cat_hob_info = pd.DataFrame()\n",
- " user_cat_hob_info['user_id'] = user_category_hob_info['user_id']\n",
- " user_cat_hob_info['cate_list'] = user_category_hob_info['category_id']\n",
- " \n",
- " return user_cat_hob_info"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:32:55.150800Z",
- "start_time": "2020-11-17T06:32:00.740046Z"
- }
- },
- "outputs": [],
- "source": [
- "user_category_hob_cols = ['user_id', 'category_id']\n",
- "user_cat_hob_info = user_cat_hob_fea(all_data, user_category_hob_cols)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的字数偏好特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:48:12.988460Z",
- "start_time": "2020-11-17T06:48:12.547000Z"
- }
- },
- "outputs": [],
- "source": [
- "user_wcou_info = all_data.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
- "user_wcou_info.rename(columns={'words_count': 'words_hbo'}, inplace=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的信息特征合并保存"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:48:18.289591Z",
- "start_time": "2020-11-17T06:48:17.084408Z"
- }
- },
- "outputs": [],
- "source": [
- "# 所有表进行合并\n",
- "user_info = pd.merge(user_act_fea, user_device_info, on='user_id')\n",
- "user_info = user_info.merge(user_time_hob_info, on='user_id')\n",
- "user_info = user_info.merge(user_cat_hob_info, on='user_id')\n",
- "user_info = user_info.merge(user_wcou_info, on='user_id')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:48:26.907785Z",
- "start_time": "2020-11-17T06:48:21.457597Z"
- }
- },
- "outputs": [],
- "source": [
- "# 这样用户特征以后就可以直接读取了\n",
- "user_info.to_csv(save_path + 'user_info.csv', index=False) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户特征直接读入\n",
- "如果前面关于用户的特征工程已经给做完了,后面可以直接读取"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 69,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:15:49.502826Z",
- "start_time": "2020-11-18T03:15:48.062243Z"
- }
- },
- "outputs": [],
- "source": [
- "# 把用户信息直接读入进来\n",
- "user_info = pd.read_csv(save_path + 'user_info.csv')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 70,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:15:56.899635Z",
- "start_time": "2020-11-18T03:15:53.701818Z"
- }
- },
- "outputs": [],
- "source": [
- "if os.path.exists(save_path + 'trn_user_item_feats_df.csv'):\n",
- " trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
- " \n",
- "if os.path.exists(save_path + 'tst_user_item_feats_df.csv'):\n",
- " tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
- "\n",
- "if os.path.exists(save_path + 'val_user_item_feats_df.csv'):\n",
- " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
- "else:\n",
- " val_user_item_feats_df = None"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 71,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:02.739197Z",
- "start_time": "2020-11-18T03:16:01.725028Z"
- }
- },
- "outputs": [],
- "source": [
- "# 拼上用户特征\n",
- "# 下面是线下验证的\n",
- "trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on='user_id',how='left')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 72,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:06.989877Z",
- "start_time": "2020-11-18T03:16:06.983327Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Index(['user_id', 'click_article_id', 'sim0', 'time_diff0', 'word_diff0',\n",
- " 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score', 'rank', 'label',\n",
- " 'click_size', 'time_diff_mean', 'active_level', 'click_environment',\n",
- " 'click_deviceGroup', 'click_os', 'click_country', 'click_region',\n",
- " 'click_referrer_type', 'user_time_hob1', 'user_time_hob2', 'cate_list',\n",
- " 'words_hbo'],\n",
- " dtype='object')"
- ]
- },
- "execution_count": 72,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_user_item_feats_df.columns"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:13:36.071236Z",
- "start_time": "2020-11-14T03:13:36.050188Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 文章的特征直接读入"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 73,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:12.793070Z",
- "start_time": "2020-11-18T03:16:12.425380Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
- ]
- }
- ],
- "source": [
- "articles = pd.read_csv(data_path+'articles.csv')\n",
- "articles = reduce_mem(articles)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 74,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:18.118507Z",
- "start_time": "2020-11-18T03:16:16.344338Z"
- }
- },
- "outputs": [],
- "source": [
- "# 拼上文章特征\n",
- "trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- "\n",
- "tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 召回文章的主题是否在用户的爱好里面"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 76,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:17:40.251797Z",
- "start_time": "2020-11-18T03:16:28.130012Z"
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 制作特征和标签, 转成监督学习问题\n",
+ "我们先捋一下基于原始的给定数据, 有哪些特征可以直接利用:\n",
+ "1. 文章的自身特征, category_id表示这文章的类型, created_at_ts表示文章建立的时间, 这个关系着文章的时效性, words_count是文章的字数, 一般字数太长我们不太喜欢点击, 也不排除有人就喜欢读长文。\n",
+ "2. 文章的内容embedding特征, 这个召回的时候用过, 这里可以选择使用, 也可以选择不用, 也可以尝试其他类型的embedding特征, 比如W2V等\n",
+ "3. 用户的设备特征信息\n",
+ "\n",
+ "上面这些直接可以用的特征, 待做完特征工程之后, 直接就可以根据article_id或者是user_id把这些特征加入进去。 但是我们需要先基于召回的结果, 构造一些特征,然后制作标签,形成一个监督学习的数据集。 \n",
+ "构造监督数据集的思路, 根据召回结果, 我们会得到一个{user_id: [可能点击的文章列表]}形式的字典。 那么我们就可以对于每个用户, 每篇可能点击的文章构造一个监督测试集, 比如对于用户user1, 假设得到的他的召回列表{user1: [item1, item2, item3]}, 我们就可以得到三行数据(user1, item1), (user1, item2), (user1, item3)的形式, 这就是监督测试集时候的前两列特征。 \n",
+ "\n",
+ "构造特征的思路是这样, 我们知道每个用户的点击文章是与其历史点击的文章信息是有很大关联的, 比如同一个主题, 相似等等。 所以特征构造这块很重要的一系列特征**是要结合用户的历史点击文章信息**。我们已经得到了每个用户及点击候选文章的两列的一个数据集, 而我们的目的是要预测最后一次点击的文章, 比较自然的一个思路就是和其最后几次点击的文章产生关系, 这样既考虑了其历史点击文章信息, 又得离最后一次点击较近,因为新闻很大的一个特点就是注重时效性。 往往用户的最后一次点击会和其最后几次点击有很大的关联。 所以我们就可以对于每个候选文章, 做出与最后几次点击相关的特征如下:\n",
+ "1. 候选item与最后几次点击的相似性特征(embedding内积) --- 这个直接关联用户历史行为\n",
+ "2. 候选item与最后几次点击的相似性特征的统计特征 --- 统计特征可以减少一些波动和异常\n",
+ "3. 候选item与最后几次点击文章的字数差的特征 --- 可以通过字数看用户偏好\n",
+ "4. 候选item与最后几次点击的文章建立的时间差特征 --- 时间差特征可以看出该用户对于文章的实时性的偏好 \n",
+ "\n",
+ "\n",
+ "还需要考虑一下\n",
+ "**5. 如果使用了youtube召回的话, 我们还可以制作用户与候选item的相似特征**\n",
+ "\n",
+ "\n",
+ "\n",
+ "当然, 上面只是提供了一种基于用户历史行为做特征工程的思路, 大家也可以思维风暴一下,尝试一些其他的特征。 下面我们就实现上面的这些特征的制作, 下面的逻辑是这样:\n",
+ "1. 我们首先获得用户的最后一次点击操作和用户的历史点击, 这个基于我们的日志数据集做\n",
+ "2. 基于用户的历史行为制作特征, 这个会用到用户的历史点击表, 最后的召回列表, 文章的信息表和embedding向量\n",
+ "3. 制作标签, 形成最后的监督学习数据集"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 导包"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:00.341709Z",
+ "start_time": "2020-11-17T09:06:58.723900Z"
+ },
+ "cell_style": "center",
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import pickle\n",
+ "from tqdm import tqdm\n",
+ "import gc, os\n",
+ "import logging\n",
+ "import time\n",
+ "import lightgbm as lgb\n",
+ "from gensim.models import Word2Vec\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# df节省内存函数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:02.411005Z",
+ "start_time": "2020-11-17T09:07:02.397830Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 节省内存的一个函数\n",
+ "# 减少内存\n",
+ "def reduce_mem(df):\n",
+ " starttime = time.time()\n",
+ " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
+ " start_mem = df.memory_usage().sum() / 1024**2\n",
+ " for col in df.columns:\n",
+ " col_type = df[col].dtypes\n",
+ " if col_type in numerics:\n",
+ " c_min = df[col].min()\n",
+ " c_max = df[col].max()\n",
+ " if pd.isnull(c_min) or pd.isnull(c_max):\n",
+ " continue\n",
+ " if str(col_type)[:3] == 'int':\n",
+ " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
+ " df[col] = df[col].astype(np.int8)\n",
+ " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
+ " df[col] = df[col].astype(np.int16)\n",
+ " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
+ " df[col] = df[col].astype(np.int32)\n",
+ " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
+ " df[col] = df[col].astype(np.int64)\n",
+ " else:\n",
+ " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
+ " df[col] = df[col].astype(np.float16)\n",
+ " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
+ " df[col] = df[col].astype(np.float32)\n",
+ " else:\n",
+ " df[col] = df[col].astype(np.float64)\n",
+ " end_mem = df.memory_usage().sum() / 1024**2\n",
+ " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
+ " 100*(start_mem-end_mem)/start_mem,\n",
+ " (time.time()-starttime)/60))\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:05.031436Z",
+ "start_time": "2020-11-17T09:07:05.026822Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data_path = './data_raw/'\n",
+ "save_path = './temp_results/'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 数据读取\n",
+ "\n",
+ "## 训练和验证集的划分\n",
+ "\n",
+ "划分训练和验证集的原因是为了在线下验证模型参数的好坏,为了完全模拟测试集,我们这里就在训练集中抽取部分用户的所有信息来作为验证集。提前做训练验证集划分的好处就是可以分解制作排序特征时的压力,一次性做整个数据集的排序特征可能时间会比较长。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:07.230308Z",
+ "start_time": "2020-11-17T09:07:07.221081Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# all_click_df指的是训练集\n",
+ "# sample_user_nums 采样作为验证集的用户数量\n",
+ "def trn_val_split(all_click_df, sample_user_nums):\n",
+ " all_click = all_click_df\n",
+ " all_user_ids = all_click.user_id.unique()\n",
+ " \n",
+ " # replace=True表示可以重复抽样,反之不可以\n",
+ " sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False) \n",
+ " \n",
+ " click_val = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
+ " click_trn = all_click[~all_click['user_id'].isin(sample_user_ids)]\n",
+ " \n",
+ " # 将验证集中的最后一次点击给抽取出来作为答案\n",
+ " click_val = click_val.sort_values(['user_id', 'click_timestamp'])\n",
+ " val_ans = click_val.groupby('user_id').tail(1)\n",
+ " \n",
+ " click_val = click_val.groupby('user_id').apply(lambda x: x[:-1]).reset_index(drop=True)\n",
+ " \n",
+ " # 去除val_ans中某些用户只有一个点击数据的情况,如果该用户只有一个点击数据,又被分到ans中,\n",
+ " # 那么训练集中就没有这个用户的点击数据,出现用户冷启动问题,给自己模型验证带来麻烦\n",
+ " val_ans = val_ans[val_ans.user_id.isin(click_val.user_id.unique())] # 保证答案中出现的用户再验证集中还有\n",
+ " click_val = click_val[click_val.user_id.isin(val_ans.user_id.unique())]\n",
+ " \n",
+ " return click_trn, click_val, val_ans"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 获取历史点击和最后一次点击"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:19.202550Z",
+ "start_time": "2020-11-17T09:07:19.195766Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取当前数据的历史点击和最后一次点击\n",
+ "def get_hist_and_last_click(all_click):\n",
+ " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
+ " click_last_df = all_click.groupby('user_id').tail(1)\n",
+ "\n",
+ " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
+ " def hist_func(user_df):\n",
+ " if len(user_df) == 1:\n",
+ " return user_df\n",
+ " else:\n",
+ " return user_df[:-1]\n",
+ "\n",
+ " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
+ "\n",
+ " return click_hist_df, click_last_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取训练、验证及测试集"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:21.181211Z",
+ "start_time": "2020-11-17T09:07:21.171338Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_trn_val_tst_data(data_path, offline=True):\n",
+ " if offline:\n",
+ " click_trn_data = pd.read_csv(data_path+'train_click_log.csv') # 训练集用户点击日志\n",
+ " click_trn_data = reduce_mem(click_trn_data)\n",
+ " click_trn, click_val, val_ans = trn_val_split(click_trn_data, sample_user_nums)\n",
+ " else:\n",
+ " click_trn = pd.read_csv(data_path+'train_click_log.csv')\n",
+ " click_trn = reduce_mem(click_trn)\n",
+ " click_val = None\n",
+ " val_ans = None\n",
+ " \n",
+ " click_tst = pd.read_csv(data_path+'testA_click_log.csv')\n",
+ " \n",
+ " return click_trn, click_val, click_tst, val_ans"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取召回列表"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:23.210604Z",
+ "start_time": "2020-11-17T09:07:23.203652Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 返回多路召回列表或者单路召回\n",
+ "def get_recall_list(save_path, single_recall_model=None, multi_recall=False):\n",
+ " if multi_recall:\n",
+ " return pickle.load(open(save_path + 'final_recall_items_dict.pkl', 'rb'))\n",
+ " \n",
+ " if single_recall_model == 'i2i_itemcf':\n",
+ " return pickle.load(open(save_path + 'itemcf_recall_dict.pkl', 'rb'))\n",
+ " elif single_recall_model == 'i2i_emb_itemcf':\n",
+ " return pickle.load(open(save_path + 'itemcf_emb_dict.pkl', 'rb'))\n",
+ " elif single_recall_model == 'user_cf':\n",
+ " return pickle.load(open(save_path + 'youtubednn_usercf_dict.pkl', 'rb'))\n",
+ " elif single_recall_model == 'youtubednn':\n",
+ " return pickle.load(open(save_path + 'youtube_u2i_dict.pkl', 'rb'))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取各种Embedding"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### Word2Vec训练及gensim的使用\n",
+ "\n",
+ "Word2Vec主要思想是:一个词的上下文可以很好的表达出词的语义。通过无监督学习产生词向量的方式。word2vec中有两个非常经典的模型:skip-gram和cbow。\n",
+ "\n",
+ "- skip-gram:已知中心词预测周围词。\n",
+ "- cbow:已知周围词预测中心词。\n",
+ "![image-20201106225233086](https://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)\n",
+ "\n",
+ "在使用gensim训练word2vec的时候,有几个比较重要的参数\n",
+ "- size: 表示词向量的维度。\n",
+ "- window:决定了目标词会与多远距离的上下文产生关系。\n",
+ "- sg: 如果是0,则是CBOW模型,是1则是Skip-Gram模型。\n",
+ "- workers: 表示训练时候的线程数量\n",
+ "- min_count: 设置最小的\n",
+ "- iter: 训练时遍历整个数据集的次数\n",
+ "\n",
+ "**注意**\n",
+ "1. 训练的时候输入的语料库一定要是字符组成的二维数组,如:[['北', '京', '你', '好'], ['上', '海', '你', '好']]\n",
+ "2. 使用模型的时候有一些默认值,可以通过在Jupyter里面通过`Word2Vec??`查看\n",
+ "\n",
+ "\n",
+ "下面是个简单的测试样例:\n",
+ "```\n",
+ "from gensim.models import Word2Vec\n",
+ "doc = [['30760', '157507'],\n",
+ " ['289197', '63746'],\n",
+ " ['36162', '168401'],\n",
+ " ['50644', '36162']]\n",
+ "w2v = Word2Vec(docs, size=12, sg=1, window=2, seed=2020, workers=2, min_count=1, iter=1)\n",
+ "\n",
+ "# 查看'30760'表示的词向量\n",
+ "w2v['30760']\n",
+ "```\n",
+ "\n",
+ "skip-gram和cbow的详细原理可以参考下面的博客:\n",
+ "- [word2vec原理(一) CBOW与Skip-Gram模型基础](https://www.cnblogs.com/pinard/p/7160330.html) \n",
+ "- [word2vec原理(二) 基于Hierarchical Softmax的模型](https://www.cnblogs.com/pinard/p/7160330.html) \n",
+ "- [word2vec原理(三) 基于Negative Sampling的模型](https://www.cnblogs.com/pinard/p/7249903.html) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:26.676173Z",
+ "start_time": "2020-11-17T09:07:26.667926Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def trian_item_word2vec(click_df, embed_size=64, save_name='item_w2v_emb.pkl', split_char=' '):\n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " # 只有转换成字符串才可以进行训练\n",
+ " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
+ " # 转换成句子的形式\n",
+ " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
+ " docs = docs['click_article_id'].values.tolist()\n",
+ "\n",
+ " # 为了方便查看训练的进度,这里设定一个log信息\n",
+ " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
+ "\n",
+ " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
+ " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=1)\n",
+ " \n",
+ " # 保存成字典的形式\n",
+ " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
+ " pickle.dump(item_w2v_emb_dict, open(save_path + 'item_w2v_emb.pkl', 'wb'))\n",
+ " \n",
+ " return item_w2v_emb_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:27.285690Z",
+ "start_time": "2020-11-17T09:07:27.276646Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 可以通过字典查询对应的item的Embedding\n",
+ "def get_embedding(save_path, all_click_df):\n",
+ " if os.path.exists(save_path + 'item_content_emb.pkl'):\n",
+ " item_content_emb_dict = pickle.load(open(save_path + 'item_content_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " print('item_content_emb.pkl 文件不存在...')\n",
+ " \n",
+ " # w2v Embedding是需要提前训练好的\n",
+ " if os.path.exists(save_path + 'item_w2v_emb.pkl'):\n",
+ " item_w2v_emb_dict = pickle.load(open(save_path + 'item_w2v_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " item_w2v_emb_dict = trian_item_word2vec(all_click_df)\n",
+ " \n",
+ " if os.path.exists(save_path + 'item_youtube_emb.pkl'):\n",
+ " item_youtube_emb_dict = pickle.load(open(save_path + 'item_youtube_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " print('item_youtube_emb.pkl 文件不存在...')\n",
+ " \n",
+ " if os.path.exists(save_path + 'user_youtube_emb.pkl'):\n",
+ " user_youtube_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " print('user_youtube_emb.pkl 文件不存在...')\n",
+ " \n",
+ " return item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取文章信息"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:28.391797Z",
+ "start_time": "2020-11-17T09:07:28.386650Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_article_info_df():\n",
+ " article_info_df = pd.read_csv(data_path + 'articles.csv')\n",
+ " article_info_df = reduce_mem(article_info_df)\n",
+ " \n",
+ " return article_info_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取数据"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:32.362045Z",
+ "start_time": "2020-11-17T09:07:29.490413Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-- Mem. usage decreased to 23.34 Mb (69.4% reduction),time spend:0.00 min\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 这里offline的online的区别就是验证集是否为空\n",
+ "click_trn, click_val, click_tst, val_ans = get_trn_val_tst_data(data_path, offline=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:10.378966Z",
+ "start_time": "2020-11-17T09:07:32.468580Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "click_trn_hist, click_trn_last = get_hist_and_last_click(click_trn)\n",
+ "\n",
+ "if click_val is not None:\n",
+ " click_val_hist, click_val_last = click_val, val_ans\n",
+ "else:\n",
+ " click_val_hist, click_val_last = None, None\n",
+ " \n",
+ "click_tst_hist = click_tst"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 对训练数据做负采样\n",
+ "\n",
+ "通过召回我们将数据转换成三元组的形式(user1, item1, label)的形式,观察发现正负样本差距极度不平衡,我们可以先对负样本进行下采样,下采样的目的一方面缓解了正负样本比例的问题,另一方面也减小了我们做排序特征的压力,我们在做负采样的时候又有哪些东西是需要注意的呢?\n",
+ "\n",
+ "1. 只对负样本进行下采样(如果有比较好的正样本扩充的方法其实也是可以考虑的)\n",
+ "2. 负采样之后,保证所有的用户和文章仍然出现在采样之后的数据中\n",
+ "3. 下采样的比例可以根据实际情况人为的控制\n",
+ "4. 做完负采样之后,更新此时新的用户召回文章列表,因为后续做特征的时候可能用到相对位置的信息。\n",
+ "\n",
+ "其实负采样也可以留在后面做完特征在进行,这里由于做排序特征太慢了,所以把负采样的环节提到前面了。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:36.096678Z",
+ "start_time": "2020-11-17T09:11:36.090911Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 将召回列表转换成df的形式\n",
+ "def recall_dict_2_df(recall_list_dict):\n",
+ " df_row_list = [] # [user, item, score]\n",
+ " for user, recall_list in tqdm(recall_list_dict.items()):\n",
+ " for item, score in recall_list:\n",
+ " df_row_list.append([user, item, score])\n",
+ " \n",
+ " col_names = ['user_id', 'sim_item', 'score']\n",
+ " recall_list_df = pd.DataFrame(df_row_list, columns=col_names)\n",
+ " \n",
+ " return recall_list_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:37.668844Z",
+ "start_time": "2020-11-17T09:11:37.659774Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 负采样函数,这里可以控制负采样时的比例, 这里给了一个默认的值\n",
+ "def neg_sample_recall_data(recall_items_df, sample_rate=0.001):\n",
+ " pos_data = recall_items_df[recall_items_df['label'] == 1]\n",
+ " neg_data = recall_items_df[recall_items_df['label'] == 0]\n",
+ " \n",
+ " print('pos_data_num:', len(pos_data), 'neg_data_num:', len(neg_data), 'pos/neg:', len(pos_data)/len(neg_data))\n",
+ " \n",
+ " # 分组采样函数\n",
+ " def neg_sample_func(group_df):\n",
+ " neg_num = len(group_df)\n",
+ " sample_num = max(int(neg_num * sample_rate), 1) # 保证最少有一个\n",
+ " sample_num = min(sample_num, 5) # 保证最多不超过5个,这里可以根据实际情况进行选择\n",
+ " return group_df.sample(n=sample_num, replace=True)\n",
+ " \n",
+ " # 对用户进行负采样,保证所有用户都在采样后的数据中\n",
+ " neg_data_user_sample = neg_data.groupby('user_id', group_keys=False).apply(neg_sample_func)\n",
+ " # 对文章进行负采样,保证所有文章都在采样后的数据中\n",
+ " neg_data_item_sample = neg_data.groupby('sim_item', group_keys=False).apply(neg_sample_func)\n",
+ " \n",
+ " # 将上述两种情况下的采样数据合并\n",
+ " neg_data_new = neg_data_user_sample.append(neg_data_item_sample)\n",
+ " # 由于上述两个操作是分开的,可能将两个相同的数据给重复选择了,所以需要对合并后的数据进行去重\n",
+ " neg_data_new = neg_data_new.sort_values(['user_id', 'score']).drop_duplicates(['user_id', 'sim_item'], keep='last')\n",
+ " \n",
+ " # 将正样本数据合并\n",
+ " data_new = pd.concat([pos_data, neg_data_new], ignore_index=True)\n",
+ " \n",
+ " return data_new"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:39.481715Z",
+ "start_time": "2020-11-17T09:11:39.475144Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 召回数据打标签\n",
+ "def get_rank_label_df(recall_list_df, label_df, is_test=False):\n",
+ " # 测试集是没有标签了,为了后面代码同一一些,这里直接给一个负数替代\n",
+ " if is_test:\n",
+ " recall_list_df['label'] = -1\n",
+ " return recall_list_df\n",
+ " \n",
+ " label_df = label_df.rename(columns={'click_article_id': 'sim_item'})\n",
+ " recall_list_df_ = recall_list_df.merge(label_df[['user_id', 'sim_item', 'click_timestamp']], \\\n",
+ " how='left', on=['user_id', 'sim_item'])\n",
+ " recall_list_df_['label'] = recall_list_df_['click_timestamp'].apply(lambda x: 0.0 if np.isnan(x) else 1.0)\n",
+ " del recall_list_df_['click_timestamp']\n",
+ " \n",
+ " return recall_list_df_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:41.555566Z",
+ "start_time": "2020-11-17T09:11:41.546766Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_user_recall_item_label_df(click_trn_hist, click_val_hist, click_tst_hist,click_trn_last, click_val_last, recall_list_df):\n",
+ " # 获取训练数据的召回列表\n",
+ " trn_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_trn_hist['user_id'].unique())]\n",
+ " # 训练数据打标签\n",
+ " trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False)\n",
+ " # 训练数据负采样\n",
+ " trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df)\n",
+ " \n",
+ " if click_val is not None:\n",
+ " val_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_val_hist['user_id'].unique())]\n",
+ " val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False)\n",
+ " val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df)\n",
+ " else:\n",
+ " val_user_item_label_df = None\n",
+ " \n",
+ " # 测试数据不需要进行负采样,直接对所有的召回商品进行打-1标签\n",
+ " tst_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_tst_hist['user_id'].unique())]\n",
+ " tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None, is_test=True)\n",
+ " \n",
+ " return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:23:35.357045Z",
+ "start_time": "2020-11-17T17:23:12.378284Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [00:12<00:00, 20689.39it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 读取召回列表\n",
+ "recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf') # 这里只选择了单路召回的结果,也可以选择多路召回结果\n",
+ "# 将召回数据转换成df\n",
+ "recall_list_df = recall_dict_2_df(recall_list_dict)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:29:04.598214Z",
+ "start_time": "2020-11-17T17:23:40.001052Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "pos_data_num: 64190 neg_data_num: 1935810 pos/neg: 0.03315924600038227\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 给训练验证数据打标签,并负采样(这一部分时间比较久)\n",
+ "trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df(click_trn_hist, \n",
+ " click_val_hist, \n",
+ " click_tst_hist,\n",
+ " click_trn_last, \n",
+ " click_val_last, \n",
+ " recall_list_df)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:23:11.642944Z",
+ "start_time": "2020-11-17T17:23:08.475Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_label_df.label"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 将召回数据转换成字典"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:36:22.800449Z",
+ "start_time": "2020-11-17T17:36:22.794670Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 将最终的召回的df数据转换成字典的形式做排序特征\n",
+ "def make_tuple_func(group_df):\n",
+ " row_data = []\n",
+ " for name, row_df in group_df.iterrows():\n",
+ " row_data.append((row_df['sim_item'], row_df['score'], row_df['label']))\n",
+ " \n",
+ " return row_data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:40:05.991819Z",
+ "start_time": "2020-11-17T17:36:26.536429Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_label_tuples = trn_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
+ "trn_user_item_label_tuples_dict = dict(zip(trn_user_item_label_tuples['user_id'], trn_user_item_label_tuples[0]))\n",
+ "\n",
+ "if val_user_item_label_df is not None:\n",
+ " val_user_item_label_tuples = val_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
+ " val_user_item_label_tuples_dict = dict(zip(val_user_item_label_tuples['user_id'], val_user_item_label_tuples[0]))\n",
+ "else:\n",
+ " val_user_item_label_tuples_dict = None\n",
+ " \n",
+ "tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
+ "tst_user_item_label_tuples_dict = dict(zip(tst_user_item_label_tuples['user_id'], tst_user_item_label_tuples[0]))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:59:53.141560Z",
+ "start_time": "2020-11-17T07:59:53.133599Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 特征工程"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 制作与用户历史行为相关特征\n",
+ "对于每个用户召回的每个商品, 做特征。 具体步骤如下:\n",
+ "* 对于每个用户, 获取最后点击的N个商品的item_id, \n",
+ " * 对于该用户的每个召回商品, 计算与上面最后N次点击商品的相似度的和(最大, 最小,均值), 时间差特征,相似性特征,字数差特征,与该用户的相似性特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T01:07:47.268035Z",
+ "start_time": "2020-11-18T01:07:47.250449Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 下面基于data做历史相关的特征\n",
+ "def create_feature(users_id, recall_list, click_hist_df, articles_info, articles_emb, user_emb=None, N=1):\n",
+ " \"\"\"\n",
+ " 基于用户的历史行为做相关特征\n",
+ " :param users_id: 用户id\n",
+ " :param recall_list: 对于每个用户召回的候选文章列表\n",
+ " :param click_hist_df: 用户的历史点击信息\n",
+ " :param articles_info: 文章信息\n",
+ " :param articles_emb: 文章的embedding向量, 这个可以用item_content_emb, item_w2v_emb, item_youtube_emb\n",
+ " :param user_emb: 用户的embedding向量, 这个是user_youtube_emb, 如果没有也可以不用, 但要注意如果要用的话, articles_emb就要用item_youtube_emb的形式, 这样维度才一样\n",
+ " :param N: 最近的N次点击 由于testA日志里面很多用户只存在一次历史点击, 所以为了不产生空值,默认是1\n",
+ " \"\"\"\n",
+ " \n",
+ " # 建立一个二维列表保存结果, 后面要转成DataFrame\n",
+ " all_user_feas = []\n",
+ " i = 0\n",
+ " for user_id in tqdm(users_id):\n",
+ " # 该用户的最后N次点击\n",
+ " hist_user_items = click_hist_df[click_hist_df['user_id']==user_id]['click_article_id'][-N:]\n",
+ " \n",
+ " # 遍历该用户的召回列表\n",
+ " for rank, (article_id, score, label) in enumerate(recall_list[user_id]):\n",
+ " # 该文章建立时间, 字数\n",
+ " a_create_time = articles_info[articles_info['article_id']==article_id]['created_at_ts'].values[0]\n",
+ " a_words_count = articles_info[articles_info['article_id']==article_id]['words_count'].values[0]\n",
+ " single_user_fea = [user_id, article_id]\n",
+ " # 计算与最后点击的商品的相似度的和, 最大值和最小值, 均值\n",
+ " sim_fea = []\n",
+ " time_fea = []\n",
+ " word_fea = []\n",
+ " # 遍历用户的最后N次点击文章\n",
+ " for hist_item in hist_user_items:\n",
+ " b_create_time = articles_info[articles_info['article_id']==hist_item]['created_at_ts'].values[0]\n",
+ " b_words_count = articles_info[articles_info['article_id']==hist_item]['words_count'].values[0]\n",
+ " \n",
+ " sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id]))\n",
+ " time_fea.append(abs(a_create_time-b_create_time))\n",
+ " word_fea.append(abs(a_words_count-b_words_count))\n",
+ " \n",
+ " single_user_fea.extend(sim_fea) # 相似性特征\n",
+ " single_user_fea.extend(time_fea) # 时间差特征\n",
+ " single_user_fea.extend(word_fea) # 字数差特征\n",
+ " single_user_fea.extend([max(sim_fea), min(sim_fea), sum(sim_fea), sum(sim_fea) / len(sim_fea)]) # 相似性的统计特征\n",
+ " \n",
+ " if user_emb: # 如果用户向量有的话, 这里计算该召回文章与用户的相似性特征 \n",
+ " single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id]))\n",
+ " \n",
+ " single_user_fea.extend([score, rank, label]) \n",
+ " # 加入到总的表中\n",
+ " all_user_feas.append(single_user_fea)\n",
+ " \n",
+ " # 定义列名\n",
+ " id_cols = ['user_id', 'click_article_id']\n",
+ " sim_cols = ['sim' + str(i) for i in range(N)]\n",
+ " time_cols = ['time_diff' + str(i) for i in range(N)]\n",
+ " word_cols = ['word_diff' + str(i) for i in range(N)]\n",
+ " sat_cols = ['sim_max', 'sim_min', 'sim_sum', 'sim_mean']\n",
+ " user_item_sim_cols = ['user_item_sim'] if user_emb else []\n",
+ " user_score_rank_label = ['score', 'rank', 'label']\n",
+ " cols = id_cols + sim_cols + time_cols + word_cols + sat_cols + user_item_sim_cols + user_score_rank_label\n",
+ " \n",
+ " # 转成DataFrame\n",
+ " df = pd.DataFrame( all_user_feas, columns=cols)\n",
+ " \n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T01:08:17.531694Z",
+ "start_time": "2020-11-18T01:08:10.754702Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
+ ]
+ }
+ ],
+ "source": [
+ "article_info_df = get_article_info_df()\n",
+ "all_click = click_trn.append(click_tst)\n",
+ "item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, all_click)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:06:22.709350Z",
+ "start_time": "2020-11-18T01:08:39.923811Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 200000/200000 [50:16<00:00, 66.31it/s] \n",
+ "100%|██████████| 50000/50000 [1:07:21<00:00, 12.37it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 获取训练验证及测试数据中召回列文章相关特征\n",
+ "trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict, \\\n",
+ " click_trn_hist, article_info_df, item_content_emb_dict)\n",
+ "\n",
+ "if val_user_item_label_tuples_dict is not None:\n",
+ " val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict, \\\n",
+ " click_val_hist, article_info_df, item_content_emb_dict)\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict, \\\n",
+ " click_tst_hist, article_info_df, item_content_emb_dict)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 63,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:13:58.573422Z",
+ "start_time": "2020-11-18T03:13:40.157228Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 保存一份省的每次都要重新跑,每次跑的时间都比较长\n",
+ "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
+ "\n",
+ "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:14:22.838154Z",
+ "start_time": "2020-11-18T03:14:22.828212Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 用户和文章特征\n",
+ "### 用户相关特征\n",
+ "这一块,正式进行特征工程,既要拼接上已有的特征, 也会做更多的特征出来,我们来梳理一下已有的特征和可构造特征:\n",
+ "1. 文章自身的特征, 文章字数,文章创建时间, 文章的embedding (articles表中)\n",
+ "2. 用户点击环境特征, 那些设备的特征(这个在df中)\n",
+ "3. 对于用户和商品还可以构造的特征:\n",
+ " * 基于用户的点击文章次数和点击时间构造可以表现用户活跃度的特征\n",
+ " * 基于文章被点击次数和时间构造可以反映文章热度的特征\n",
+ " * 用户的时间统计特征: 根据其点击的历史文章列表的点击时间和文章的创建时间做统计特征,比如求均值, 这个可以反映用户对于文章时效的偏好\n",
+ " * 用户的主题爱好特征, 对于用户点击的历史文章主题进行一个统计, 然后对于当前文章看看是否属于用户已经点击过的主题\n",
+ " * 用户的字数爱好特征, 对于用户点击的历史文章的字数统计, 求一个均值"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:16:37.637495Z",
+ "start_time": "2020-11-14T03:16:37.618229Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "click_tst.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:09:11.675550Z",
+ "start_time": "2020-11-17T02:09:10.265134Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取文章特征\n",
+ "articles = pd.read_csv(data_path+'articles.csv')\n",
+ "articles = reduce_mem(articles)\n",
+ "\n",
+ "# 日志数据,就是前面的所有数据\n",
+ "if click_val is not None:\n",
+ " all_data = click_trn.append(click_val)\n",
+ "all_data = click_trn.append(click_tst)\n",
+ "all_data = reduce_mem(all_data)\n",
+ "\n",
+ "# 拼上文章信息\n",
+ "all_data = all_data.merge(articles, left_on='click_article_id', right_on='article_id')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:17:12.256244Z",
+ "start_time": "2020-11-14T03:17:12.250452Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "all_data.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 分析一下点击时间和点击文章的次数,区分用户活跃度\n",
+ "如果某个用户点击文章之间的时间间隔比较小, 同时点击的文章次数很多的话, 那么我们认为这种用户一般就是活跃用户, 当然衡量用户活跃度的方式可能多种多样, 这里我们只提供其中一种,我们写一个函数, 得到可以衡量用户活跃度的特征,逻辑如下:\n",
+ "1. 首先根据用户user_id分组, 对于每个用户,计算点击文章的次数, 两两点击文章时间间隔的均值\n",
+ "2. 把点击次数取倒数和时间间隔的均值统一归一化,然后两者相加合并,该值越小, 说明用户越活跃\n",
+ "3. 注意, 上面两两点击文章的时间间隔均值, 会出现如果用户只点击了一次的情况,这时候时间间隔均值那里会出现空值, 对于这种情况最后特征那里给个大数进行区分\n",
+ "\n",
+ "这个的衡量标准就是先把点击的次数取到数然后归一化, 然后点击的时间差归一化, 然后两者相加进行合并, 该值越小, 说明被点击的次数越多, 且间隔时间短。 "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:28:55.336058Z",
+ "start_time": "2020-11-17T02:28:55.324332Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ " def active_level(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作区分用户活跃度的特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " data = all_data[cols]\n",
+ " data.sort_values(['user_id', 'click_timestamp'], inplace=True)\n",
+ " user_act = pd.DataFrame(data.groupby('user_id', as_index=False)[['click_article_id', 'click_timestamp']].\\\n",
+ " agg({'click_article_id':np.size, 'click_timestamp': {list}}).values, columns=['user_id', 'click_size', 'click_timestamp'])\n",
+ " \n",
+ " # 计算时间间隔的均值\n",
+ " def time_diff_mean(l):\n",
+ " if len(l) == 1:\n",
+ " return 1\n",
+ " else:\n",
+ " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
+ " \n",
+ " user_act['time_diff_mean'] = user_act['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
+ " \n",
+ " # 点击次数取倒数\n",
+ " user_act['click_size'] = 1 / user_act['click_size']\n",
+ " \n",
+ " # 两者归一化\n",
+ " user_act['click_size'] = (user_act['click_size'] - user_act['click_size'].min()) / (user_act['click_size'].max() - user_act['click_size'].min())\n",
+ " user_act['time_diff_mean'] = (user_act['time_diff_mean'] - user_act['time_diff_mean'].min()) / (user_act['time_diff_mean'].max() - user_act['time_diff_mean'].min()) \n",
+ " user_act['active_level'] = user_act['click_size'] + user_act['time_diff_mean']\n",
+ " \n",
+ " user_act['user_id'] = user_act['user_id'].astype('int')\n",
+ " del user_act['click_timestamp']\n",
+ " \n",
+ " return user_act"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:30:12.696060Z",
+ "start_time": "2020-11-17T02:29:01.523837Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_act_fea = active_level(all_data, ['user_id', 'click_article_id', 'click_timestamp'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:28:53.996742Z",
+ "start_time": "2020-11-17T02:09:18.374Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_act_fea.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 分析一下点击时间和被点击文章的次数, 衡量文章热度特征\n",
+ "和上面同样的思路, 如果一篇文章在很短的时间间隔之内被点击了很多次, 说明文章比较热门,实现的逻辑和上面的基本一致, 只不过这里是按照点击的文章进行分组:\n",
+ "1. 根据文章进行分组, 对于每篇文章的用户, 计算点击的时间间隔\n",
+ "2. 将用户的数量取倒数, 然后用户的数量和时间间隔归一化, 然后相加得到热度特征, 该值越小, 说明被点击的次数越大且时间间隔越短, 文章比较热\n",
+ "\n",
+ "当然, 这只是给出一种判断文章热度的一种方法, 这里大家也可以头脑风暴一下"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:41:26.398567Z",
+ "start_time": "2020-11-17T02:41:26.386668Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ " def hot_level(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作衡量文章热度的特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " data = all_data[cols]\n",
+ " data.sort_values(['click_article_id', 'click_timestamp'], inplace=True)\n",
+ " article_hot = pd.DataFrame(data.groupby('click_article_id', as_index=False)[['user_id', 'click_timestamp']].\\\n",
+ " agg({'user_id':np.size, 'click_timestamp': {list}}).values, columns=['click_article_id', 'user_num', 'click_timestamp'])\n",
+ " \n",
+ " # 计算被点击时间间隔的均值\n",
+ " def time_diff_mean(l):\n",
+ " if len(l) == 1:\n",
+ " return 1\n",
+ " else:\n",
+ " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
+ " \n",
+ " article_hot['time_diff_mean'] = article_hot['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
+ " \n",
+ " # 点击次数取倒数\n",
+ " article_hot['user_num'] = 1 / article_hot['user_num']\n",
+ " \n",
+ " # 两者归一化\n",
+ " article_hot['user_num'] = (article_hot['user_num'] - article_hot['user_num'].min()) / (article_hot['user_num'].max() - article_hot['user_num'].min())\n",
+ " article_hot['time_diff_mean'] = (article_hot['time_diff_mean'] - article_hot['time_diff_mean'].min()) / (article_hot['time_diff_mean'].max() - article_hot['time_diff_mean'].min()) \n",
+ " article_hot['hot_level'] = article_hot['user_num'] + article_hot['time_diff_mean']\n",
+ " \n",
+ " article_hot['click_article_id'] = article_hot['click_article_id'].astype('int')\n",
+ " \n",
+ " del article_hot['click_timestamp']\n",
+ " \n",
+ " return article_hot"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:41:44.635900Z",
+ "start_time": "2020-11-17T02:41:31.473032Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "article_hot_fea = hot_level(all_data, ['user_id', 'click_article_id', 'click_timestamp']) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:19:54.775290Z",
+ "start_time": "2020-11-14T03:19:54.763699Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "article_hot_fea.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的系列习惯\n",
+ "这个基于原来的日志表做一个类似于article的那种DataFrame, 存放用户特有的信息, 主要包括点击习惯, 爱好特征之类的\n",
+ "* 用户的设备习惯, 这里取最常用的设备(众数)\n",
+ "* 用户的时间习惯: 根据其点击过得历史文章的时间来做一个统计(这个感觉最好是把时间戳里的时间特征的h特征提出来,看看用户习惯一天的啥时候点击文章), 但这里先用转换的时间吧, 求个均值\n",
+ "* 用户的爱好特征, 对于用户点击的历史文章主题进行用户的爱好判别, 更偏向于哪几个主题, 这个最好是multi-hot进行编码, 先试试行不\n",
+ "* 用户文章的字数差特征, 用户的爱好文章的字数习惯\n",
+ "\n",
+ "这些就是对用户进行分组, 然后统计即可"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的设备习惯"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T04:22:48.877978Z",
+ "start_time": "2020-11-17T04:22:48.872049Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def device_fea(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作用户的设备特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " user_device_info = all_data[cols]\n",
+ " \n",
+ " # 用众数来表示每个用户的设备信息\n",
+ " user_device_info = user_device_info.groupby('user_id').agg(lambda x: x.value_counts().index[0]).reset_index()\n",
+ " \n",
+ " return user_device_info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T05:27:10.897473Z",
+ "start_time": "2020-11-17T04:49:33.214865Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 设备特征(这里时间会比较长)\n",
+ "device_cols = ['user_id', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 'click_referrer_type']\n",
+ "user_device_info = device_fea(all_data, device_cols)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T04:20:39.765842Z",
+ "start_time": "2020-11-14T04:20:39.747087Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_device_info.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的时间习惯"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:11:50.889905Z",
+ "start_time": "2020-11-17T06:11:50.882653Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def user_time_hob_fea(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作用户的时间习惯特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " user_time_hob_info = all_data[cols]\n",
+ " \n",
+ " # 先把时间戳进行归一化\n",
+ " mm = MinMaxScaler()\n",
+ " user_time_hob_info['click_timestamp'] = mm.fit_transform(user_time_hob_info[['click_timestamp']])\n",
+ " user_time_hob_info['created_at_ts'] = mm.fit_transform(user_time_hob_info[['created_at_ts']])\n",
+ "\n",
+ " user_time_hob_info = user_time_hob_info.groupby('user_id').agg('mean').reset_index()\n",
+ " \n",
+ " user_time_hob_info.rename(columns={'click_timestamp': 'user_time_hob1', 'created_at_ts': 'user_time_hob2'}, inplace=True)\n",
+ " return user_time_hob_info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:31:51.646110Z",
+ "start_time": "2020-11-17T06:31:51.171431Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_time_hob_cols = ['user_id', 'click_timestamp', 'created_at_ts']\n",
+ "user_time_hob_info = user_time_hob_fea(all_data, user_time_hob_cols)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的主题爱好\n",
+ "这里先把用户点击的文章属于的主题转成一个列表, 后面再总的汇总的时候单独制作一个特征, 就是文章的主题如果属于这里面, 就是1, 否则就是0。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:31:56.571088Z",
+ "start_time": "2020-11-17T06:31:56.565304Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def user_cat_hob_fea(all_data, cols):\n",
+ " \"\"\"\n",
+ " 用户的主题爱好\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " user_category_hob_info = all_data[cols]\n",
+ " user_category_hob_info = user_category_hob_info.groupby('user_id').agg({list}).reset_index()\n",
+ " \n",
+ " user_cat_hob_info = pd.DataFrame()\n",
+ " user_cat_hob_info['user_id'] = user_category_hob_info['user_id']\n",
+ " user_cat_hob_info['cate_list'] = user_category_hob_info['category_id']\n",
+ " \n",
+ " return user_cat_hob_info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:32:55.150800Z",
+ "start_time": "2020-11-17T06:32:00.740046Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_category_hob_cols = ['user_id', 'category_id']\n",
+ "user_cat_hob_info = user_cat_hob_fea(all_data, user_category_hob_cols)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的字数偏好特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:48:12.988460Z",
+ "start_time": "2020-11-17T06:48:12.547000Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_wcou_info = all_data.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
+ "user_wcou_info.rename(columns={'words_count': 'words_hbo'}, inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的信息特征合并保存"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:48:18.289591Z",
+ "start_time": "2020-11-17T06:48:17.084408Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 所有表进行合并\n",
+ "user_info = pd.merge(user_act_fea, user_device_info, on='user_id')\n",
+ "user_info = user_info.merge(user_time_hob_info, on='user_id')\n",
+ "user_info = user_info.merge(user_cat_hob_info, on='user_id')\n",
+ "user_info = user_info.merge(user_wcou_info, on='user_id')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:48:26.907785Z",
+ "start_time": "2020-11-17T06:48:21.457597Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 这样用户特征以后就可以直接读取了\n",
+ "user_info.to_csv(save_path + 'user_info.csv', index=False) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户特征直接读入\n",
+ "如果前面关于用户的特征工程已经给做完了,后面可以直接读取"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 69,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:15:49.502826Z",
+ "start_time": "2020-11-18T03:15:48.062243Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 把用户信息直接读入进来\n",
+ "user_info = pd.read_csv(save_path + 'user_info.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 70,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:15:56.899635Z",
+ "start_time": "2020-11-18T03:15:53.701818Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "if os.path.exists(save_path + 'trn_user_item_feats_df.csv'):\n",
+ " trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
+ " \n",
+ "if os.path.exists(save_path + 'tst_user_item_feats_df.csv'):\n",
+ " tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
+ "\n",
+ "if os.path.exists(save_path + 'val_user_item_feats_df.csv'):\n",
+ " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
+ "else:\n",
+ " val_user_item_feats_df = None"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 71,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:02.739197Z",
+ "start_time": "2020-11-18T03:16:01.725028Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 拼上用户特征\n",
+ "# 下面是线下验证的\n",
+ "trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on='user_id',how='left')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 72,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:06.989877Z",
+ "start_time": "2020-11-18T03:16:06.983327Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Index(['user_id', 'click_article_id', 'sim0', 'time_diff0', 'word_diff0',\n",
+ " 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score', 'rank', 'label',\n",
+ " 'click_size', 'time_diff_mean', 'active_level', 'click_environment',\n",
+ " 'click_deviceGroup', 'click_os', 'click_country', 'click_region',\n",
+ " 'click_referrer_type', 'user_time_hob1', 'user_time_hob2', 'cate_list',\n",
+ " 'words_hbo'],\n",
+ " dtype='object')"
+ ]
+ },
+ "execution_count": 72,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trn_user_item_feats_df.columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:13:36.071236Z",
+ "start_time": "2020-11-14T03:13:36.050188Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 文章的特征直接读入"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 73,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:12.793070Z",
+ "start_time": "2020-11-18T03:16:12.425380Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
+ ]
+ }
+ ],
+ "source": [
+ "articles = pd.read_csv(data_path+'articles.csv')\n",
+ "articles = reduce_mem(articles)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 74,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:18.118507Z",
+ "start_time": "2020-11-18T03:16:16.344338Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 拼上文章特征\n",
+ "trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ "\n",
+ "tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 召回文章的主题是否在用户的爱好里面"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 76,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:17:40.251797Z",
+ "start_time": "2020-11-18T03:16:28.130012Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_feats_df['is_cat_hab'] = trn_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df['is_cat_hab'] = val_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ "tst_user_item_feats_df['is_cat_hab'] = tst_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 77,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:19:30.451200Z",
+ "start_time": "2020-11-18T03:19:30.411225Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 线下验证\n",
+ "del trn_user_item_feats_df['cate_list']\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " del val_user_item_feats_df['cate_list']\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "del tst_user_item_feats_df['cate_list']\n",
+ "\n",
+ "del trn_user_item_feats_df['article_id']\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " del val_user_item_feats_df['article_id']\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "del tst_user_item_feats_df['article_id']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 保存特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 78,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:20:08.560942Z",
+ "start_time": "2020-11-18T03:19:35.601095Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "# 训练验证特征\n",
+ "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
+ "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 总结\n",
+ "特征工程和数据清洗转换是比赛中至关重要的一块, 因为**数据和特征决定了机器学习的上限,而算法和模型只是逼近这个上限而已**,所以特征工程的好坏往往决定着最后的结果,**特征工程**可以一步增强数据的表达能力,通过构造新特征,我们可以挖掘出数据的更多信息,使得数据的表达能力进一步放大。 在本节内容中,我们主要是先通过制作特征和标签把预测问题转成了监督学习问题,然后围绕着用户画像和文章画像进行一系列特征的制作, 此外,为了保证正负样本的数据均衡,我们还学习了负采样就技术等。当然本节内容只是对构造特征提供了一些思路,也请学习者们在学习过程中开启头脑风暴,尝试更多的构造特征的方法,也欢迎我们一块探讨和交流。\n",
+ "\n",
+ "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- },
- "outputs": [],
- "source": [
- "trn_user_item_feats_df['is_cat_hab'] = trn_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df['is_cat_hab'] = val_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- "tst_user_item_feats_df['is_cat_hab'] = tst_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 77,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:19:30.451200Z",
- "start_time": "2020-11-18T03:19:30.411225Z"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.5"
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [],
+ "description": "",
+ "notebookId": "130010",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "218px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "outputs": [],
- "source": [
- "# 线下验证\n",
- "del trn_user_item_feats_df['cate_list']\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " del val_user_item_feats_df['cate_list']\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "del tst_user_item_feats_df['cate_list']\n",
- "\n",
- "del trn_user_item_feats_df['article_id']\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " del val_user_item_feats_df['article_id']\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "del tst_user_item_feats_df['article_id']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 保存特征"
- ]
},
- {
- "cell_type": "code",
- "execution_count": 78,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:20:08.560942Z",
- "start_time": "2020-11-18T03:19:35.601095Z"
- },
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "# 训练验证特征\n",
- "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
- "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 总结\n",
- "特征工程和数据清洗转换是比赛中至关重要的一块, 因为**数据和特征决定了机器学习的上限,而算法和模型只是逼近这个上限而已**,所以特征工程的好坏往往决定着最后的结果,**特征工程**可以一步增强数据的表达能力,通过构造新特征,我们可以挖掘出数据的更多信息,使得数据的表达能力进一步放大。 在本节内容中,我们主要是先通过制作特征和标签把预测问题转成了监督学习问题,然后围绕着用户画像和文章画像进行一系列特征的制作, 此外,为了保证正负样本的数据均衡,我们还学习了负采样就技术等。当然本节内容只是对构造特征提供了一些思路,也请学习者们在学习过程中开启头脑风暴,尝试更多的构造特征的方法,也欢迎我们一块探讨和交流。\n",
- "\n",
- "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [],
- "description": "",
- "notebookId": "130010",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "218px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
- },
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
- }
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git "a/docs/ch03/ch3.1/jupyter/\350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb" "b/docs/ch03/ch3.1/jupyter/\350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb"
index 1dae6308d..1567babe2 100644
--- "a/docs/ch03/ch3.1/jupyter/\350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb"
+++ "b/docs/ch03/ch3.1/jupyter/\350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb"
@@ -1,664 +1,664 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 赛题理解\n",
- "赛题理解是切入一道赛题的基础,会影响后续特征工程和模型构建等各种工作,也影响着后续发展工作的方向,正确了解赛题背后的思想以及赛题业务逻辑的清晰,有利于花费更少时间构建更为有效的特征模型, 在各种比赛中, 赛题理解都是极其重要且必须走好的第一步, 今天我们就从赛题的理解出发, 首先了解一下这次赛题的概况和数据,从中分析赛题以及大致的处理方式, 其次我们了解模型评测的指标,最后对赛题的理解整理一些经验。\n",
- "\n",
- "## 赛题简介\n",
- "此次比赛是新闻推荐场景下的用户行为预测挑战赛, 该赛题是以新闻APP中的新闻推荐为背景, 目的是**要求我们根据用户历史浏览点击新闻文章的数据信息预测用户未来的点击行为, 即用户的最后一次点击的新闻文章**, 这道赛题的设计初衷是引导大家了解推荐系统中的一些业务背景, 解决实际问题。 \n",
- "\n",
- "## 数据概况\n",
- "该数据来自某新闻APP平台的用户交互数据,包括30万用户,近300万次点击,共36万多篇不同的新闻文章,同时每篇新闻文章有对应的embedding向量表示。为了保证比赛的公平性,从中抽取20万用户的点击日志数据作为训练集,5万用户的点击日志数据作为测试集A,5万用户的点击日志数据作为测试集B。具体数据表和参数, 大家可以参考赛题说明。下面说一下拿到这样的数据如何进行理解, 来有效的开展下一步的工作。 \n",
- "## 评价方式理解\n",
- "理解评价方式, 我们需要结合着最后的提交文件来看, 根据sample.submit.csv, 我们最后提交的格式是针对每个用户, 我们都会给出五篇文章的推荐结果,按照点击概率从前往后排序。 而真实的每个用户最后一次点击的文章只会有一篇的真实答案, 所以我们就看我们推荐的这五篇里面是否有命中真实答案的。比如对于user1来说, 我们的提交会是:\n",
- ">user1, article1, article2, article3, article4, article5.\n",
- "\n",
- "评价指标的公式如下:\n",
- "$$\n",
- "score(user) = \\sum_{k=1}^5 \\frac{s(user, k)}{k}\n",
- "$$\n",
- "\n",
- "假如article1就是真实的用户点击文章,也就是article1命中, 则s(user1,1)=1, s(user1,2-4)都是0, 如果article2是用户点击的文章, 则s(user,2)=1/2,s(user,1,3,4,5)都是0。也就是score(user)=命中第几条的倒数。如果都没中, 则score(user1)=0。 这个是合理的, 因为我们希望的就是命中的结果尽量靠前, 而此时分数正好比较高。\n",
- "\n",
- "## 赛题理解\n",
- "根据赛题简介,我们首先要明确我们此次比赛的目标: 根据用户历史浏览点击新闻的数据信息预测用户最后一次点击的新闻文章。从这个目标上看, 会发现此次比赛和我们之前遇到的普通的结构化比赛不太一样, 主要有两点:\n",
- " \n",
- "- 首先是目标上, 要预测最后一次点击的新闻文章,也就是我们给用户推荐的是新闻文章, 并不是像之前那种预测一个数或者预测数据哪一类那样的问题\n",
- "- 数据上, 通过给出的数据我们会发现, 这种数据也不是我们之前遇到的那种特征+标签的数据,而是基于了真实的业务场景, 拿到的用户的点击日志\n",
- "\n",
- "所以拿到这个题目,我们的思考方向就是结合我们的目标,**把该预测问题转成一个监督学习的问题(特征+标签),然后我们才能进行ML,DL等建模预测**。那么我们自然而然的就应该在心里会有这么几个问题:如何转成一个监督学习问题呢? 转成一个什么样的监督学习问题呢? 我们能利用的特征又有哪些呢? 又有哪些模型可以尝试呢? 此次面对数万级别的文章推荐,我们又有哪些策略呢? \n",
- "\n",
- "当然这些问题不会在我们刚看到赛题之后就一下出来答案, 但是只要有了问题之后, 我们就能想办法解决问题了, 比如上面的第二个问题,转成一个什么样的监督学习问题? 由于我们是预测用户最后一次点击的新闻文章,从36万篇文章中预测某一篇的话我们首先可能会想到这可能是一个多分类的问题(36万类里面选1), 但是如此庞大的分类问题, 我们做起来可能比较困难, 那么能不能转化一下? 既然是要预测最后一次点击的文章, 那么如果我们能预测出某个用户最后一次对于某一篇文章会进行点击的概率, 是不是就间接性的解决了这个问题呢?概率最大的那篇文章不就是用户最后一次可能点击的新闻文章吗? 这样就把原问题变成了一个点击率预测的问题(用户, 文章) --> 点击的概率(软分类), 而这个问题, 就是我们所熟悉的监督学习领域分类问题了, 这样我们后面建模的时候, 对于模型的选择就基本上有大致方向了,比如最简单的逻辑回归模型。 \n",
- "这样, 我们对于该赛题的解决方案应该有了一个大致的解决思路,要先转成一个分类问题来做, 而分类的标签就是用户是否会点击某篇文章,分类问题的特征中会有用户和文章,我们要训练一个分类模型, 对某用户最后一次点击某篇文章的概率进行预测。 那么又会有几个问题:如何转成监督学习问题? 训练集和测试集怎么制作? 我们又能利用哪些特征? 我们又可以尝试哪些模型? 面对36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模?如何进行最后的预测? "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Baseline"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:46:49.678700Z",
- "start_time": "2020-11-16T07:46:49.673336Z"
- }
- },
- "outputs": [],
- "source": [
- "# import packages\n",
- "import time, math, os\n",
- "from tqdm import tqdm\n",
- "import gc\n",
- "import pickle\n",
- "import random\n",
- "from datetime import datetime\n",
- "from operator import itemgetter\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "import warnings\n",
- "from collections import defaultdict\n",
- "import collections\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:48:34.240098Z",
- "start_time": "2020-11-16T07:48:34.236370Z"
- }
- },
- "outputs": [],
- "source": [
- "# data_path = './data_raw/'\n",
- "data_path = '/home/admin/jupyter/data/' # 天池平台路径\n",
- "save_path = '/home/admin/jupyter/temp_results/' # 天池平台路径"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## df节省内存函数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 节约内存的一个标配函数\n",
- "def reduce_mem(df):\n",
- " starttime = time.time()\n",
- " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
- " start_mem = df.memory_usage().sum() / 1024**2\n",
- " for col in df.columns:\n",
- " col_type = df[col].dtypes\n",
- " if col_type in numerics:\n",
- " c_min = df[col].min()\n",
- " c_max = df[col].max()\n",
- " if pd.isnull(c_min) or pd.isnull(c_max):\n",
- " continue\n",
- " if str(col_type)[:3] == 'int':\n",
- " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
- " df[col] = df[col].astype(np.int8)\n",
- " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
- " df[col] = df[col].astype(np.int16)\n",
- " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
- " df[col] = df[col].astype(np.int32)\n",
- " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
- " df[col] = df[col].astype(np.int64)\n",
- " else:\n",
- " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
- " df[col] = df[col].astype(np.float16)\n",
- " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
- " df[col] = df[col].astype(np.float32)\n",
- " else:\n",
- " df[col] = df[col].astype(np.float64)\n",
- " end_mem = df.memory_usage().sum() / 1024**2\n",
- " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
- " 100*(start_mem-end_mem)/start_mem,\n",
- " (time.time()-starttime)/60))\n",
- " return df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取采样或全量数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:48:50.619963Z",
- "start_time": "2020-11-16T07:48:50.611667Z"
- }
- },
- "outputs": [],
- "source": [
- "# debug模式:从训练集中划出一部分数据来调试代码\n",
- "def get_all_click_sample(data_path, sample_nums=10000):\n",
- " \"\"\"\n",
- " 训练集中采样一部分数据调试\n",
- " data_path: 原数据的存储路径\n",
- " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
- " \"\"\"\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " all_user_ids = all_click.user_id.unique()\n",
- "\n",
- " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
- " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click\n",
- "\n",
- "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
- "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
- "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
- " if offline:\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " else:\n",
- " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
- "\n",
- " all_click = trn_click.append(tst_click)\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 全量训练集\n",
- "all_click_df = get_all_click_df(data_path, offline=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 获取 用户 - 文章 - 点击时间字典"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:56:39.800240Z",
- "start_time": "2020-11-16T07:56:39.793541Z"
- }
- },
- "outputs": [],
- "source": [
- "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- "def get_user_item_time(click_df):\n",
- " \n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " \n",
- " def make_item_time_pair(df):\n",
- " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
- " \n",
- " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
- " .reset_index().rename(columns={0: 'item_time_list'})\n",
- " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
- " \n",
- " return user_item_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 获取点击最多的topk个文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 获取近期点击最多的文章\n",
- "def get_item_topk_click(click_df, k):\n",
- " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
- " return topk_click"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## itemcf的物品相似度计算"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:51:07.577037Z",
- "start_time": "2020-11-16T07:51:07.568098Z"
- }
- },
- "outputs": [],
- "source": [
- "def itemcf_sim(df):\n",
- " \"\"\"\n",
- " 文章与文章之间的相似性矩阵计算\n",
- " :param df: 数据表\n",
- " :item_created_time_dict: 文章创建时间的字典\n",
- " return : 文章与文章的相似性矩阵\n",
- " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
- " \"\"\"\n",
- " \n",
- " user_item_time_dict = get_user_item_time(df)\n",
- " \n",
- " # 计算物品相似度\n",
- " i2i_sim = {}\n",
- " item_cnt = defaultdict(int)\n",
- " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
- " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
- " for i, i_click_time in item_time_list:\n",
- " item_cnt[i] += 1\n",
- " i2i_sim.setdefault(i, {})\n",
- " for j, j_click_time in item_time_list:\n",
- " if(i == j):\n",
- " continue\n",
- " i2i_sim[i].setdefault(j, 0)\n",
- " \n",
- " i2i_sim[i][j] += 1 / math.log(len(item_time_list) + 1)\n",
- " \n",
- " i2i_sim_ = i2i_sim.copy()\n",
- " for i, related_items in i2i_sim.items():\n",
- " for j, wij in related_items.items():\n",
- " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
- " \n",
- " # 将得到的相似性矩阵保存到本地\n",
- " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
- " \n",
- " return i2i_sim_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:53:10.038470Z",
- "start_time": "2020-11-16T07:51:11.281176Z"
- }
- },
- "outputs": [
+ "cells": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [00:23<00:00, 10802.38it/s]\n"
- ]
- }
- ],
- "source": [
- "i2i_sim = itemcf_sim(all_click_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## itemcf 的文章推荐"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T08:03:18.383215Z",
- "start_time": "2020-11-16T08:03:18.373432Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于商品的召回i2i\n",
- "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click):\n",
- " \"\"\"\n",
- " 基于文章协同过滤的召回\n",
- " :param user_id: 用户id\n",
- " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- " :param i2i_sim: 字典,文章相似性矩阵\n",
- " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
- " :param recall_item_num: 整数, 最后的召回文章数量\n",
- " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全 \n",
- " return: 召回的文章列表 {item1:score1, item2: score2...}\n",
- " 注意: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
- " \"\"\"\n",
- " \n",
- " # 获取用户历史交互的文章\n",
- " user_hist_items = user_item_time_dict[user_id]\n",
- " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
- " \n",
- " item_rank = {}\n",
- " for loc, (i, click_time) in enumerate(user_hist_items):\n",
- " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
- " if j in user_hist_items_:\n",
- " continue\n",
- " \n",
- " item_rank.setdefault(j, 0)\n",
- " item_rank[j] += wij\n",
- " \n",
- " # 不足10个,用热门商品补全\n",
- " if len(item_rank) < recall_item_num:\n",
- " for i, item in enumerate(item_topk_click):\n",
- " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
- " continue\n",
- " item_rank[item] = - i - 100 # 随便给个负数就行\n",
- " if len(item_rank) == recall_item_num:\n",
- " break\n",
- " \n",
- " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
- " \n",
- " return item_rank"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 给每个用户根据物品的协同过滤推荐文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:15:01.109798Z",
- "start_time": "2020-11-16T08:11:07.233787Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 赛题理解\n",
+ "赛题理解是切入一道赛题的基础,会影响后续特征工程和模型构建等各种工作,也影响着后续发展工作的方向,正确了解赛题背后的思想以及赛题业务逻辑的清晰,有利于花费更少时间构建更为有效的特征模型, 在各种比赛中, 赛题理解都是极其重要且必须走好的第一步, 今天我们就从赛题的理解出发, 首先了解一下这次赛题的概况和数据,从中分析赛题以及大致的处理方式, 其次我们了解模型评测的指标,最后对赛题的理解整理一些经验。\n",
+ "\n",
+ "## 赛题简介\n",
+ "此次比赛是新闻推荐场景下的用户行为预测挑战赛, 该赛题是以新闻APP中的新闻推荐为背景, 目的是**要求我们根据用户历史浏览点击新闻文章的数据信息预测用户未来的点击行为, 即用户的最后一次点击的新闻文章**, 这道赛题的设计初衷是引导大家了解推荐系统中的一些业务背景, 解决实际问题。 \n",
+ "\n",
+ "## 数据概况\n",
+ "该数据来自某新闻APP平台的用户交互数据,包括30万用户,近300万次点击,共36万多篇不同的新闻文章,同时每篇新闻文章有对应的embedding向量表示。为了保证比赛的公平性,从中抽取20万用户的点击日志数据作为训练集,5万用户的点击日志数据作为测试集A,5万用户的点击日志数据作为测试集B。具体数据表和参数, 大家可以参考赛题说明。下面说一下拿到这样的数据如何进行理解, 来有效的开展下一步的工作。 \n",
+ "## 评价方式理解\n",
+ "理解评价方式, 我们需要结合着最后的提交文件来看, 根据sample.submit.csv, 我们最后提交的格式是针对每个用户, 我们都会给出五篇文章的推荐结果,按照点击概率从前往后排序。 而真实的每个用户最后一次点击的文章只会有一篇的真实答案, 所以我们就看我们推荐的这五篇里面是否有命中真实答案的。比如对于user1来说, 我们的提交会是:\n",
+ ">user1, article1, article2, article3, article4, article5.\n",
+ "\n",
+ "评价指标的公式如下:\n",
+ "$$\n",
+ "score(user) = \\sum_{k=1}^5 \\frac{s(user, k)}{k}\n",
+ "$$\n",
+ "\n",
+ "假如article1就是真实的用户点击文章,也就是article1命中, 则s(user1,1)=1, s(user1,2-4)都是0, 如果article2是用户点击的文章, 则s(user,2)=1/2,s(user,1,3,4,5)都是0。也就是score(user)=命中第几条的倒数。如果都没中, 则score(user1)=0。 这个是合理的, 因为我们希望的就是命中的结果尽量靠前, 而此时分数正好比较高。\n",
+ "\n",
+ "## 赛题理解\n",
+ "根据赛题简介,我们首先要明确我们此次比赛的目标: 根据用户历史浏览点击新闻的数据信息预测用户最后一次点击的新闻文章。从这个目标上看, 会发现此次比赛和我们之前遇到的普通的结构化比赛不太一样, 主要有两点:\n",
+ " \n",
+ "- 首先是目标上, 要预测最后一次点击的新闻文章,也就是我们给用户推荐的是新闻文章, 并不是像之前那种预测一个数或者预测数据哪一类那样的问题\n",
+ "- 数据上, 通过给出的数据我们会发现, 这种数据也不是我们之前遇到的那种特征+标签的数据,而是基于了真实的业务场景, 拿到的用户的点击日志\n",
+ "\n",
+ "所以拿到这个题目,我们的思考方向就是结合我们的目标,**把该预测问题转成一个监督学习的问题(特征+标签),然后我们才能进行ML,DL等建模预测**。那么我们自然而然的就应该在心里会有这么几个问题:如何转成一个监督学习问题呢? 转成一个什么样的监督学习问题呢? 我们能利用的特征又有哪些呢? 又有哪些模型可以尝试呢? 此次面对数万级别的文章推荐,我们又有哪些策略呢? \n",
+ "\n",
+ "当然这些问题不会在我们刚看到赛题之后就一下出来答案, 但是只要有了问题之后, 我们就能想办法解决问题了, 比如上面的第二个问题,转成一个什么样的监督学习问题? 由于我们是预测用户最后一次点击的新闻文章,从36万篇文章中预测某一篇的话我们首先可能会想到这可能是一个多分类的问题(36万类里面选1), 但是如此庞大的分类问题, 我们做起来可能比较困难, 那么能不能转化一下? 既然是要预测最后一次点击的文章, 那么如果我们能预测出某个用户最后一次对于某一篇文章会进行点击的概率, 是不是就间接性的解决了这个问题呢?概率最大的那篇文章不就是用户最后一次可能点击的新闻文章吗? 这样就把原问题变成了一个点击率预测的问题(用户, 文章) --> 点击的概率(软分类), 而这个问题, 就是我们所熟悉的监督学习领域分类问题了, 这样我们后面建模的时候, 对于模型的选择就基本上有大致方向了,比如最简单的逻辑回归模型。 \n",
+ "这样, 我们对于该赛题的解决方案应该有了一个大致的解决思路,要先转成一个分类问题来做, 而分类的标签就是用户是否会点击某篇文章,分类问题的特征中会有用户和文章,我们要训练一个分类模型, 对某用户最后一次点击某篇文章的概率进行预测。 那么又会有几个问题:如何转成监督学习问题? 训练集和测试集怎么制作? 我们又能利用哪些特征? 我们又可以尝试哪些模型? 面对36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模?如何进行最后的预测? "
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [43:19<00:00, 96.18it/s] \n"
- ]
- }
- ],
- "source": [
- "# 定义\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "\n",
- "# 获取 用户 - 文章 - 点击时间的字典\n",
- "user_item_time_dict = get_user_item_time(all_click_df)\n",
- "\n",
- "# 去取文章相似度\n",
- "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
- "\n",
- "# 相似文章的数量\n",
- "sim_item_topk = 10\n",
- "\n",
- "# 召回文章数量\n",
- "recall_item_num = 10\n",
- "\n",
- "# 用户热度补全\n",
- "item_topk_click = get_item_topk_click(all_click_df, k=50)\n",
- "\n",
- "for user in tqdm(all_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, \n",
- " sim_item_topk, recall_item_num, item_topk_click)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 召回字典转换成df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:16:36.647466Z",
- "start_time": "2020-11-16T10:16:24.791219Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Baseline"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [00:04<00:00, 53319.08it/s]\n"
- ]
- }
- ],
- "source": [
- "# 将字典的形式转换成df\n",
- "user_item_score_list = []\n",
- "\n",
- "for user, items in tqdm(user_recall_items_dict.items()):\n",
- " for item, score in items:\n",
- " user_item_score_list.append([user, item, score])\n",
- "\n",
- "recall_df = pd.DataFrame(user_item_score_list, columns=['user_id', 'click_article_id', 'pred_score'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 生成提交文件"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:16:46.268341Z",
- "start_time": "2020-11-16T10:16:46.259293Z"
- }
- },
- "outputs": [],
- "source": [
- "# 生成提交文件\n",
- "def submit(recall_df, topk=5, model_name=None):\n",
- " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
- " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 判断是不是每个用户都有5篇文章及以上\n",
- " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
- " assert tmp.min() >= topk\n",
- " \n",
- " del recall_df['pred_score']\n",
- " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
- " \n",
- " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
- " # 按照提交格式定义列名\n",
- " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
- " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
- " \n",
- " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
- " submit.to_csv(save_name, index=False, header=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:17:42.254328Z",
- "start_time": "2020-11-16T10:17:32.211862Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 导包"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:46:49.678700Z",
+ "start_time": "2020-11-16T07:46:49.673336Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# import packages\n",
+ "import time, math, os\n",
+ "from tqdm import tqdm\n",
+ "import gc\n",
+ "import pickle\n",
+ "import random\n",
+ "from datetime import datetime\n",
+ "from operator import itemgetter\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import warnings\n",
+ "from collections import defaultdict\n",
+ "import collections\n",
+ "warnings.filterwarnings('ignore')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:48:34.240098Z",
+ "start_time": "2020-11-16T07:48:34.236370Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# data_path = './data_raw/'\n",
+ "data_path = '/home/admin/jupyter/data/' # 天池平台路径\n",
+ "save_path = '/home/admin/jupyter/temp_results/' # 天池平台路径"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## df节省内存函数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 节约内存的一个标配函数\n",
+ "def reduce_mem(df):\n",
+ " starttime = time.time()\n",
+ " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
+ " start_mem = df.memory_usage().sum() / 1024**2\n",
+ " for col in df.columns:\n",
+ " col_type = df[col].dtypes\n",
+ " if col_type in numerics:\n",
+ " c_min = df[col].min()\n",
+ " c_max = df[col].max()\n",
+ " if pd.isnull(c_min) or pd.isnull(c_max):\n",
+ " continue\n",
+ " if str(col_type)[:3] == 'int':\n",
+ " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
+ " df[col] = df[col].astype(np.int8)\n",
+ " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
+ " df[col] = df[col].astype(np.int16)\n",
+ " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
+ " df[col] = df[col].astype(np.int32)\n",
+ " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
+ " df[col] = df[col].astype(np.int64)\n",
+ " else:\n",
+ " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
+ " df[col] = df[col].astype(np.float16)\n",
+ " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
+ " df[col] = df[col].astype(np.float32)\n",
+ " else:\n",
+ " df[col] = df[col].astype(np.float64)\n",
+ " end_mem = df.memory_usage().sum() / 1024**2\n",
+ " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
+ " 100*(start_mem-end_mem)/start_mem,\n",
+ " (time.time()-starttime)/60))\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取采样或全量数据"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:48:50.619963Z",
+ "start_time": "2020-11-16T07:48:50.611667Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# debug模式:从训练集中划出一部分数据来调试代码\n",
+ "def get_all_click_sample(data_path, sample_nums=10000):\n",
+ " \"\"\"\n",
+ " 训练集中采样一部分数据调试\n",
+ " data_path: 原数据的存储路径\n",
+ " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
+ " \"\"\"\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " all_user_ids = all_click.user_id.unique()\n",
+ "\n",
+ " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
+ " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click\n",
+ "\n",
+ "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
+ "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
+ "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
+ " if offline:\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " else:\n",
+ " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
+ "\n",
+ " all_click = trn_click.append(tst_click)\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 全量训练集\n",
+ "all_click_df = get_all_click_df(data_path, offline=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 获取 用户 - 文章 - 点击时间字典"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:56:39.800240Z",
+ "start_time": "2020-11-16T07:56:39.793541Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ "def get_user_item_time(click_df):\n",
+ " \n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " \n",
+ " def make_item_time_pair(df):\n",
+ " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
+ " \n",
+ " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
+ " .reset_index().rename(columns={0: 'item_time_list'})\n",
+ " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
+ " \n",
+ " return user_item_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 获取点击最多的topk个文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 获取近期点击最多的文章\n",
+ "def get_item_topk_click(click_df, k):\n",
+ " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
+ " return topk_click"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## itemcf的物品相似度计算"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:51:07.577037Z",
+ "start_time": "2020-11-16T07:51:07.568098Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def itemcf_sim(df):\n",
+ " \"\"\"\n",
+ " 文章与文章之间的相似性矩阵计算\n",
+ " :param df: 数据表\n",
+ " :item_created_time_dict: 文章创建时间的字典\n",
+ " return : 文章与文章的相似性矩阵\n",
+ " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
+ " \"\"\"\n",
+ " \n",
+ " user_item_time_dict = get_user_item_time(df)\n",
+ " \n",
+ " # 计算物品相似度\n",
+ " i2i_sim = {}\n",
+ " item_cnt = defaultdict(int)\n",
+ " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
+ " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
+ " for i, i_click_time in item_time_list:\n",
+ " item_cnt[i] += 1\n",
+ " i2i_sim.setdefault(i, {})\n",
+ " for j, j_click_time in item_time_list:\n",
+ " if(i == j):\n",
+ " continue\n",
+ " i2i_sim[i].setdefault(j, 0)\n",
+ " \n",
+ " i2i_sim[i][j] += 1 / math.log(len(item_time_list) + 1)\n",
+ " \n",
+ " i2i_sim_ = i2i_sim.copy()\n",
+ " for i, related_items in i2i_sim.items():\n",
+ " for j, wij in related_items.items():\n",
+ " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
+ " \n",
+ " # 将得到的相似性矩阵保存到本地\n",
+ " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
+ " \n",
+ " return i2i_sim_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:53:10.038470Z",
+ "start_time": "2020-11-16T07:51:11.281176Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [00:23<00:00, 10802.38it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "i2i_sim = itemcf_sim(all_click_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## itemcf 的文章推荐"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T08:03:18.383215Z",
+ "start_time": "2020-11-16T08:03:18.373432Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于商品的召回i2i\n",
+ "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click):\n",
+ " \"\"\"\n",
+ " 基于文章协同过滤的召回\n",
+ " :param user_id: 用户id\n",
+ " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ " :param i2i_sim: 字典,文章相似性矩阵\n",
+ " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
+ " :param recall_item_num: 整数, 最后的召回文章数量\n",
+ " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全 \n",
+ " return: 召回的文章列表 {item1:score1, item2: score2...}\n",
+ " 注意: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
+ " \"\"\"\n",
+ " \n",
+ " # 获取用户历史交互的文章\n",
+ " user_hist_items = user_item_time_dict[user_id]\n",
+ " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
+ " \n",
+ " item_rank = {}\n",
+ " for loc, (i, click_time) in enumerate(user_hist_items):\n",
+ " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
+ " if j in user_hist_items_:\n",
+ " continue\n",
+ " \n",
+ " item_rank.setdefault(j, 0)\n",
+ " item_rank[j] += wij\n",
+ " \n",
+ " # 不足10个,用热门商品补全\n",
+ " if len(item_rank) < recall_item_num:\n",
+ " for i, item in enumerate(item_topk_click):\n",
+ " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
+ " continue\n",
+ " item_rank[item] = - i - 100 # 随便给个负数就行\n",
+ " if len(item_rank) == recall_item_num:\n",
+ " break\n",
+ " \n",
+ " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
+ " \n",
+ " return item_rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 给每个用户根据物品的协同过滤推荐文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:15:01.109798Z",
+ "start_time": "2020-11-16T08:11:07.233787Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [43:19<00:00, 96.18it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 定义\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "\n",
+ "# 获取 用户 - 文章 - 点击时间的字典\n",
+ "user_item_time_dict = get_user_item_time(all_click_df)\n",
+ "\n",
+ "# 去取文章相似度\n",
+ "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
+ "\n",
+ "# 相似文章的数量\n",
+ "sim_item_topk = 10\n",
+ "\n",
+ "# 召回文章数量\n",
+ "recall_item_num = 10\n",
+ "\n",
+ "# 用户热度补全\n",
+ "item_topk_click = get_item_topk_click(all_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(all_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, \n",
+ " sim_item_topk, recall_item_num, item_topk_click)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 召回字典转换成df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:16:36.647466Z",
+ "start_time": "2020-11-16T10:16:24.791219Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [00:04<00:00, 53319.08it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 将字典的形式转换成df\n",
+ "user_item_score_list = []\n",
+ "\n",
+ "for user, items in tqdm(user_recall_items_dict.items()):\n",
+ " for item, score in items:\n",
+ " user_item_score_list.append([user, item, score])\n",
+ "\n",
+ "recall_df = pd.DataFrame(user_item_score_list, columns=['user_id', 'click_article_id', 'pred_score'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 生成提交文件"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:16:46.268341Z",
+ "start_time": "2020-11-16T10:16:46.259293Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 生成提交文件\n",
+ "def submit(recall_df, topk=5, model_name=None):\n",
+ " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
+ " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 判断是不是每个用户都有5篇文章及以上\n",
+ " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
+ " assert tmp.min() >= topk\n",
+ " \n",
+ " del recall_df['pred_score']\n",
+ " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
+ " \n",
+ " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
+ " # 按照提交格式定义列名\n",
+ " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
+ " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
+ " \n",
+ " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
+ " submit.to_csv(save_name, index=False, header=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:17:42.254328Z",
+ "start_time": "2020-11-16T10:17:32.211862Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取测试集\n",
+ "tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
+ "tst_users = tst_click['user_id'].unique()\n",
+ "\n",
+ "# 从所有的召回数据中将测试集中的用户选出来\n",
+ "tst_recall = recall_df[recall_df['user_id'].isin(tst_users)]\n",
+ "\n",
+ "# 生成提交文件\n",
+ "submit(tst_recall, topk=5, model_name='itemcf_baseline')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 总结\n",
+ "本节内容主要包括赛题简介,数据概况,评价方式以及对该赛题进行了一个总体上的思路分析,作为竞赛前的预热,旨在帮助学习者们能够更好切入该赛题,为后面的学习内容打下一个良好的基础。最后我们给出了关于本赛题的一个简易Baseline, 帮助学习者们先了解一下新闻推荐比赛的一个整理流程, 接下来我们就对于流程中的每个步骤进行详细的介绍。\n",
+ "\n",
+ "今天的学习比较简单,下面整理一下关于赛题理解的一些经验:\n",
+ "\n",
+ "* 赛题理解究竟是在理解什么? \n",
+ "\n",
+ ">**理解赛题**:从直观上对问题进行梳理, 分析问题的目标,到底要让做什么事情, **这个非常重要**\n",
+ ">\n",
+ ">**理解数据**:对赛题数据有一个初步了解,知道和任务相关的数据字段和数据字段的类型, 数据之间的内在关联等,大体梳理一下哪些数据会对我们解决问题非常有用,方便后面我们的数据分析和特征工程。\n",
+ ">\n",
+ ">**理解评估指标**:评估指标是检验我们提出的方法,我们给出结果好坏的标准,只有正确的理解了评估指标,我们才能进行更好的训练模型,更好的进行预测。此外,很多情况下,线上验证是有一定的时间和次数限制的,**所以在比赛中构建一个合理的本地的验证集和验证的评价指标是很关键的步骤,能有效的节省很多时间**。 不同的指标对于同样的预测结果是具有误差敏感的差异性的所以不同的评价指标会影响后续一些预测的侧重点。\n",
+ "\n",
+ "* 有了赛题理解之后,我们该做什么?\n",
+ "\n",
+ " >在对于赛题有了一定的了解后,分析清楚了问题的类型性质和对于数据理解 的这一基础上,我们可以梳理一个解决赛题的一个大题思路和框架\n",
+ " >\n",
+ " >我们至少要有一些相应的理解分析,比如**这题的难点可能在哪里,关键点可能在哪里,哪些地方可以挖掘更好的特征**.\n",
+ " >\n",
+ " >用什么样得线下验证方式更为稳定,**出现了过拟合或者其他问题,估摸可以用什么方法去解决这些问题**\n",
+ "\n",
+ " 这时是在一个宏观的大体下分析的,有助于摸清整个题的思路脉络,以及后续的分析方向\n",
+ "\n",
+ "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- },
- "outputs": [],
- "source": [
- "# 获取测试集\n",
- "tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
- "tst_users = tst_click['user_id'].unique()\n",
- "\n",
- "# 从所有的召回数据中将测试集中的用户选出来\n",
- "tst_recall = recall_df[recall_df['user_id'].isin(tst_users)]\n",
- "\n",
- "# 生成提交文件\n",
- "submit(tst_recall, topk=5, model_name='itemcf_baseline')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 总结\n",
- "本节内容主要包括赛题简介,数据概况,评价方式以及对该赛题进行了一个总体上的思路分析,作为竞赛前的预热,旨在帮助学习者们能够更好切入该赛题,为后面的学习内容打下一个良好的基础。最后我们给出了关于本赛题的一个简易Baseline, 帮助学习者们先了解一下新闻推荐比赛的一个整理流程, 接下来我们就对于流程中的每个步骤进行详细的介绍。\n",
- "\n",
- "今天的学习比较简单,下面整理一下关于赛题理解的一些经验:\n",
- "\n",
- "* 赛题理解究竟是在理解什么? \n",
- "\n",
- ">**理解赛题**:从直观上对问题进行梳理, 分析问题的目标,到底要让做什么事情, **这个非常重要**\n",
- ">\n",
- ">**理解数据**:对赛题数据有一个初步了解,知道和任务相关的数据字段和数据字段的类型, 数据之间的内在关联等,大体梳理一下哪些数据会对我们解决问题非常有用,方便后面我们的数据分析和特征工程。\n",
- ">\n",
- ">**理解评估指标**:评估指标是检验我们提出的方法,我们给出结果好坏的标准,只有正确的理解了评估指标,我们才能进行更好的训练模型,更好的进行预测。此外,很多情况下,线上验证是有一定的时间和次数限制的,**所以在比赛中构建一个合理的本地的验证集和验证的评价指标是很关键的步骤,能有效的节省很多时间**。 不同的指标对于同样的预测结果是具有误差敏感的差异性的所以不同的评价指标会影响后续一些预测的侧重点。\n",
- "\n",
- "* 有了赛题理解之后,我们该做什么?\n",
- "\n",
- " >在对于赛题有了一定的了解后,分析清楚了问题的类型性质和对于数据理解 的这一基础上,我们可以梳理一个解决赛题的一个大题思路和框架\n",
- " >\n",
- " >我们至少要有一些相应的理解分析,比如**这题的难点可能在哪里,关键点可能在哪里,哪些地方可以挖掘更好的特征**.\n",
- " >\n",
- " >用什么样得线下验证方式更为稳定,**出现了过拟合或者其他问题,估摸可以用什么方法去解决这些问题**\n",
- "\n",
- " 这时是在一个宏观的大体下分析的,有助于摸清整个题的思路脉络,以及后续的分析方向\n",
- "\n",
- "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.3"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [],
- "description": "",
- "notebookId": "130006",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "170px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
},
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.3"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [],
+ "description": "",
+ "notebookId": "130006",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "170px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git a/docs/ch03/ch3.1/markdown/ch3.1.1.md b/docs/ch03/ch3.1/markdown/ch3.1.1.md
index 645152157..5c3930fe0 100644
--- a/docs/ch03/ch3.1/markdown/ch3.1.1.md
+++ b/docs/ch03/ch3.1/markdown/ch3.1.1.md
@@ -377,7 +377,7 @@ submit(tst_recall, topk=5, model_name='itemcf_baseline')
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git a/docs/ch03/ch3.1/markdown/ch3.1.2.md b/docs/ch03/ch3.1/markdown/ch3.1.2.md
index 173d95002..5584973fa 100644
--- a/docs/ch03/ch3.1/markdown/ch3.1.2.md
+++ b/docs/ch03/ch3.1/markdown/ch3.1.2.md
@@ -66,7 +66,7 @@ trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])
trn_click.head()
```
-![image-20201119112706647](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112706647.png)
+![image-20201119112706647](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112706647.png)
**train_click_log.csv文件数据中每个字段的含义**
@@ -86,7 +86,7 @@ trn_click.head()
trn_click.info()
```
-![image-20201119112622939](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112622939.png)
+![image-20201119112622939](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112622939.png)
@@ -94,7 +94,7 @@ trn_click.info()
trn_click.describe()
```
-![image-20201119112649376](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112649376.png)
+![image-20201119112649376](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112649376.png)
```python
@@ -133,7 +133,7 @@ plt.tight_layout()
plt.show()
```
-![在这里插入图片描述](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/20201118000820300.png)
+![在这里插入图片描述](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/20201118000820300.png)
**从点击时间clik_timestamp来看,分布较为平均,可不做特殊处理。由于时间戳是13位的,后续将时间格式转换成10位方便计算。**
@@ -149,14 +149,14 @@ tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])
tst_click.head()
```
-![image-20201119112952261](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112952261.png)
+![image-20201119112952261](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112952261.png)
```python
tst_click.describe()
```
-![image-20201119113015529](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113015529.png)
+![image-20201119113015529](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113015529.png)
**我们可以看出训练集和测试集的用户是完全不一样的**
@@ -187,14 +187,14 @@ tst_click.groupby('user_id')['click_article_id'].count().min() # 注意测试集
item_df.head().append(item_df.tail())
```
-![image-20201119113118388](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113118388.png)
+![image-20201119113118388](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113118388.png)
```python
item_df['words_count'].value_counts()
```
-![image-20201119113147240](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113147240.png)
+![image-20201119113147240](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113147240.png)
```python
@@ -219,7 +219,7 @@ item_df.shape # 364047篇文章
item_emb_df.head()
```
-![image-20201119113253455](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113253455.png)
+![image-20201119113253455](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113253455.png)
```python
item_emb_df.shape
@@ -245,21 +245,21 @@ user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['cl
user_click_count[:10]
```
-![image-20201119113334727](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113334727.png)
+![image-20201119113334727](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113334727.png)
```python
user_click_count[user_click_count['count']>7]
```
-![image-20201119113351807](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113351807.png)
+![image-20201119113351807](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113351807.png)
```python
user_click_count['count'].unique()
```
-![image-20201119113429769](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113429769.png)
+![image-20201119113429769](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113429769.png)
```python
@@ -267,7 +267,7 @@ user_click_count['count'].unique()
user_click_count.loc[:,'count'].value_counts()
```
-![image-20201119113414785](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113414785.png)
+![image-20201119113414785](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113414785.png)
**可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征**
@@ -301,15 +301,15 @@ for _, user_df in sample_users.groupby('user_id'):
plot_envs(user_df, cols, 2, 3)
```
-![image-20201119113624424](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113624424.png)
+![image-20201119113624424](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113624424.png)
-![image-20201119113637746](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113637746.png)
+![image-20201119113637746](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113637746.png)
-![image-20201119113652132](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113652132.png)
+![image-20201119113652132](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113652132.png)
-![image-20201119113702034](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113702034.png)
+![image-20201119113702034](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113702034.png)
-![image-20201119113714135](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113714135.png)
+![image-20201119113714135](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113714135.png)
**可以看出绝大多数数的用户的点击环境是比较固定的。思路:可以基于这些环境的统计特征来代表该用户本身的属性**
@@ -322,7 +322,7 @@ plt.plot(user_click_item_count)
```
-![image-20201119113759490](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113759490.png)
+![image-20201119113759490](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113759490.png)
**可以根据用户的点击文章次数看出用户的活跃度**
@@ -332,7 +332,7 @@ plt.plot(user_click_item_count)
plt.plot(user_click_item_count[:50])
```
-![image-20201119113825586](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113825586.png)
+![image-20201119113825586](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113825586.png)
**点击次数排前50的用户的点击次数都在100次以上。思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。**
@@ -342,7 +342,7 @@ plt.plot(user_click_item_count[:50])
plt.plot(user_click_item_count[25000:50000])
```
-![image-20201119113844946](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113844946.png)
+![image-20201119113844946](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113844946.png)
**可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户**
@@ -358,14 +358,14 @@ item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'
plt.plot(item_click_count)
```
-![image-20201119113912912](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113912912.png)
+![image-20201119113912912](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113912912.png)
```python
plt.plot(item_click_count[:100])
```
-![image-20201119113930745](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113930745.png)
+![image-20201119113930745](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113930745.png)
**可以看出点击次数最多的前100篇新闻,点击次数大于1000次**
@@ -374,7 +374,7 @@ plt.plot(item_click_count[:100])
plt.plot(item_click_count[:20])
```
-![image-20201119113958254](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113958254.png)
+![image-20201119113958254](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113958254.png)
**点击次数最多的前20篇新闻,点击次数大于2500。思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。**
@@ -383,7 +383,7 @@ plt.plot(item_click_count[:20])
plt.plot(item_click_count[3500:])
```
-![image-20201119114017762](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114017762.png)
+![image-20201119114017762](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114017762.png)
**可以发现很多新闻只被点击过一两次。思路:可以定义这些新闻是冷门新闻。**
@@ -397,7 +397,7 @@ union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].ag
union_item[['count']].describe()
```
-![image-20201119114044351](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114044351.png)
+![image-20201119114044351](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114044351.png)
**由统计数据可以看出,平均共现次数2.88,最高为1687。**
@@ -411,14 +411,14 @@ y = union_item['count']
plt.scatter(x, y)
```
-![image-20201119114106223](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114106223.png)
+![image-20201119114106223](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114106223.png)
```python
plt.plot(union_item['count'].values[40000:])
```
-![image-20201119114122557](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114122557.png)
+![image-20201119114122557](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114122557.png)
**大概有70000个pair至少共现一次。**
@@ -432,7 +432,7 @@ plt.plot(union_item['count'].values[40000:])
plt.plot(user_click_merge['category_id'].value_counts().values)
```
-![image-20201119114144058](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114144058.png)
+![image-20201119114144058](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114144058.png)
```python
@@ -440,7 +440,7 @@ plt.plot(user_click_merge['category_id'].value_counts().values)
plt.plot(user_click_merge['category_id'].value_counts().values[150:])
```
-![image-20201119114201764](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114201764.png)
+![image-20201119114201764](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114201764.png)
```python
@@ -455,7 +455,7 @@ user_click_merge['words_count'].describe()
plt.plot(user_click_merge['words_count'].values)
```
-![image-20201119114241194](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114241194.png)
+![image-20201119114241194](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114241194.png)
@@ -469,7 +469,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), re
```
-![image-20201119114300286](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114300286.png)
+![image-20201119114300286](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114300286.png)
**从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。**
@@ -478,7 +478,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), re
user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()
```
-![image-20201119114318523](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114318523.png)
+![image-20201119114318523](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114318523.png)
### 用户查看文章的长度的分布
@@ -490,7 +490,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), rever
```
-![image-20201119114337448](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114337448.png)
+![image-20201119114337448](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114337448.png)
@@ -504,7 +504,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), rever
plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000])
```
-![image-20201119114355195](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114355195.png)
+![image-20201119114355195](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114355195.png)
**可以发现大多数人都是看250字以下的文章**
@@ -514,7 +514,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), rever
user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()
```
-![image-20201119114418911](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114418911.png)
+![image-20201119114418911](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114418911.png)
@@ -536,7 +536,7 @@ user_click_merge = user_click_merge.sort_values('click_timestamp')
user_click_merge.head()
```
-![image-20201119114447904](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114447904.png)
+![image-20201119114447904](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114447904.png)
```python
@@ -558,7 +558,7 @@ mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'c
plt.plot(sorted(mean_diff_click_time.values, reverse=True))
```
-![image-20201119114505086](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114505086.png)
+![image-20201119114505086](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114505086.png)
**从上图可以发现不同用户点击文章的时间差是有差异的。**
@@ -573,7 +573,7 @@ mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp',
plt.plot(sorted(mean_diff_created_time.values, reverse=True))
```
-![image-20201119122227666](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122227666.png)
+![image-20201119122227666](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122227666.png)
**从图中可以发现用户先后点击文章,文章的创建时间也是有差异的**
@@ -602,7 +602,7 @@ sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]
sub_user_info.head()
```
-![image-20201119122251274](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122251274.png)
+![image-20201119122251274](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122251274.png)
```python
@@ -625,7 +625,7 @@ for _, user_df in sub_user_info.groupby('user_id'):
```
-![image-20201119122310969](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122310969.png)
+![image-20201119122310969](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122310969.png)
@@ -654,5 +654,5 @@ for _, user_df in sub_user_info.groupby('user_id'):
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git a/docs/ch03/ch3.1/markdown/ch3.1.3.md b/docs/ch03/ch3.1/markdown/ch3.1.3.md
index 323cf46fe..9bf554093 100644
--- a/docs/ch03/ch3.1/markdown/ch3.1.3.md
+++ b/docs/ch03/ch3.1/markdown/ch3.1.3.md
@@ -2,7 +2,7 @@
所谓的“多路召回”策略,就是指采用不同的策略、特征或简单模型,分别召回一部分候选集,然后把候选集混合在一起供后续排序模型使用,可以明显的看出,“多路召回策略”是在“计算速度”和“召回率”之间进行权衡的结果。其中,各种简单策略保证候选集的快速召回,从不同角度设计的策略保证召回率接近理想的状态,不至于损伤排序效果。如下图是多路召回的一个示意图,在多路召回中,每个策略之间毫不相关,所以一般可以写并发多线程同时进行,这样可以更加高效。
-
+
上图只是一个多路召回的例子,也就是说可以使用多种不同的策略来获取用户排序的候选商品集合,而具体使用哪些召回策略其实是与业务强相关的 ,针对不同的任务就会有对于该业务真实场景下需要考虑的召回规则。例如新闻推荐,召回规则可以是“热门视频”、“导演召回”、“演员召回”、“最近上映“、”流行趋势“、”类型召回“等等。
@@ -1344,4 +1344,4 @@ final_recall_items_dict_rank = combine_recall_results(user_multi_recall_dict, we
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git a/docs/ch03/ch3.1/markdown/ch3.1.4.md b/docs/ch03/ch3.1/markdown/ch3.1.4.md
index 197765e8b..e5e267f0e 100644
--- a/docs/ch03/ch3.1/markdown/ch3.1.4.md
+++ b/docs/ch03/ch3.1/markdown/ch3.1.4.md
@@ -193,7 +193,7 @@ Word2Vec主要思想是:一个词的上下文可以很好的表达出词的语
- skip-gram:已知中心词预测周围词。
- cbow:已知周围词预测中心词。
-![image-20201106225233086](http://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)
+![image-20201106225233086](https://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)
在使用gensim训练word2vec的时候,有几个比较重要的参数
- size: 表示词向量的维度。
@@ -985,5 +985,5 @@ tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=Fa
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git a/docs/ch03/ch3.1/markdown/ch3.1.5.md b/docs/ch03/ch3.1/markdown/ch3.1.5.md
index 9fef3fda5..0e8f45abe 100644
--- a/docs/ch03/ch3.1/markdown/ch3.1.5.md
+++ b/docs/ch03/ch3.1/markdown/ch3.1.5.md
@@ -407,7 +407,7 @@ tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_be
我们下面尝试使用DIN模型, DIN的全称是Deep Interest Network, 这是阿里2018年基于前面的深度学习模型无法表达用户多样化的兴趣而提出的一个模型, 它可以通过考虑【给定的候选广告】和【用户的历史行为】的相关性,来计算用户兴趣的表示向量。具体来说就是通过引入局部激活单元,通过软搜索历史行为的相关部分来关注相关的用户兴趣,并采用加权和来获得有关候选广告的用户兴趣的表示。与候选广告相关性较高的行为会获得较高的激活权重,并支配着用户兴趣。该表示向量在不同广告上有所不同,大大提高了模型的表达能力。所以该模型对于此次新闻推荐的任务也比较适合, 我们在这里通过当前的候选文章与用户历史点击文章的相关性来计算用户对于文章的兴趣。 该模型的结构如下:
-![image-20201116201646983](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)
+![image-20201116201646983](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)
我们这里直接调包来使用这个模型, 关于这个模型的详细细节部分我们会在下一期的推荐系统组队学习中给出。下面说一下该模型如何具体使用:deepctr的函数原型如下:
@@ -949,4 +949,4 @@ submit(rank_results, topk=5, model_name='ensumble_staking')
关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
\ No newline at end of file
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
\ No newline at end of file
diff --git a/docs/ch03/ch3.2/3.2.1.3.md b/docs/ch03/ch3.2/3.2.1.3.md
index 2d79c1fbf..9153d9f15 100644
--- a/docs/ch03/ch3.2/3.2.1.3.md
+++ b/docs/ch03/ch3.2/3.2.1.3.md
@@ -20,7 +20,7 @@ sudo apt-get install redis-server
下载完成的结果
-![image-20211030164414594](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164414594.png)
+![image-20211030164414594](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164414594.png)
**启动Redis服务:**
@@ -30,7 +30,7 @@ sudo apt-get install redis-server
service redis-server status
```
-![image-20211030164432589](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164432589.png)
+![image-20211030164432589](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164432589.png)
检查当前进程,查看redis是否启动。(ps: 可以看到redis服务正在监听6379端口)
@@ -38,7 +38,7 @@ service redis-server status
ps -aux|grep redis-server
```
-![image-20211030164448713](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164448713.png)
+![image-20211030164448713](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164448713.png)
或者进入redis客户端,与服务器进行通信,当输入ping命令,如果返回 PONG 表示Redis已成功安装。
@@ -46,7 +46,7 @@ ps -aux|grep redis-server
redis-cli
```
-![image-20211030164455928](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164455928.png)
+![image-20211030164455928](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164455928.png)
上面的127.0.0.1 是redis服务器的 IP 地址,6379 是 Redis 服务器运行的端口。
diff --git a/docs/ch03/ch3.2/3.2.1.4.md b/docs/ch03/ch3.2/3.2.1.4.md
index 8a74c546e..dc29a96f1 100644
--- a/docs/ch03/ch3.2/3.2.1.4.md
+++ b/docs/ch03/ch3.2/3.2.1.4.md
@@ -129,7 +129,7 @@ class QuotesSpider(scrapy.Spider):
因为新闻爬取项目和新闻推荐系统是放在一起的,为了方便提前学习,下面直接给出项目的目录结构以及重要文件中的代码实现,最终的项目将会和新闻推荐系统一起开源出来
-
+
1. **创建一个scrapy项目:**
@@ -164,7 +164,7 @@ class SinanewsItem(scrapy.Item):
这里需要注意的一点,这里在爬取新闻的时候选择的是一个比较简洁的展示网站进行爬取的,相比直接去最新的新浪新闻观光爬取新闻简单很多,简洁的网站大概的链接:https://news.sina.com.cn/roll/#pageid=153&lid=2509&k=&num=50&page=1
-
+
```python
# -*- coding: utf-8 -*-
@@ -497,7 +497,7 @@ sh run_scrapy_sina.sh
最终查看数据库中的数据:
-
+
### 参考资料
diff --git a/docs/ch03/ch3.2/3.2.1.5.md b/docs/ch03/ch3.2/3.2.1.5.md
index 4cc60eda6..bd9d70acc 100644
--- a/docs/ch03/ch3.2/3.2.1.5.md
+++ b/docs/ch03/ch3.2/3.2.1.5.md
@@ -1,4 +1,4 @@
-![image-20211203145147649](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203145147649.png)
+![image-20211203145147649](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203145147649.png)
# 自动化构建用户及物料画像
@@ -19,13 +19,13 @@
首先说一下新物料添加到物料库的逻辑是什么,新物料添加到物料库这件事情肯定是发生在新闻爬取之后的,然后要将新物料添加到物料库还需要对新物料做一些简单的画像处理,目前我们定义的画像字段如下(处理后的画像存储在Mongodb):
-
+
具体的逻辑就是遍历今天爬取的所有文章,然后通过文章的title来判断这篇文章是否已经在物料库中(新闻网站有可能有些相同的文章会出现在多天)来去重。然后再根据我们定义的一些字段,给画像相应的字段初始化,最后就是存入画像物料池中。
关于旧物料画像的更新,这里就需要先了解一下旧物料哪些字段会被用户的行为更新。下面是新闻列表展示页,我们会发现前端会展示新闻的阅读、喜欢及收藏次数。而用户的交互(阅读、点赞和收藏)会改变这些值。
-
+
为了能够实时的在前端显示新闻的这些动态行为信息,我们提前将新闻的动态信息存储到了redis中,线上获取的时候是直接从redis中获取新闻的数据,并且如果用户对新闻产生了交互,那么这些动态信息就会被更新,我们也是直接更新redis中的值,这样做主要是为了能够让前端可以实时的获取的新闻最新的动态画像信息。
@@ -175,9 +175,9 @@ if __name__ == "__main__":
上面的内容说完了物料的更新,接下来介绍一下对于更新完的物料是如何添加到redis数据库中去的。关于新闻内容在redis中的存储,我们将新闻的信息拆成了两部分,一部分是新闻不会发生变化的属性(例如,创建时间、标题、新闻内容等),还有一部分是物料的动态属性,在redis中存储的key的标识分别为:static_news_detail:news_id和dynamic_news_detail:news_id 下面是redis中存储的真实内容
-
+
-
+
这么做的目的是为了线上实时更改物料动态信息的时候更加高效一点。当需要获取某篇新闻的详细信息的时候需要查这两份数据并将数据这两部分数据拼起来最终才发送给前端展示。这部分的代码逻辑如下:
@@ -306,11 +306,11 @@ if __name__ == "__main__":
由于我们系统中将所有注册过的用户都放到了一个表里面(新、老用户),所以每次更新画像的话只需要遍历一遍注册表中的所有用户。再说具体的画像构建逻辑之前,得先了解一下用户画像中包含哪些字段,下面是直接从mongo中查出来的
-
+
从上面可以看出,主要是用户的基本信息和用户历史信息相关的一些标签,对于用户的基本属性特征这个可以直接从注册表中获取,那么对于跟用户历史阅读相关的信息,需要统计用户历史的所有阅读、喜欢和收藏的新闻详细信息。为了得到跟用户历史兴趣相关的信息,我们需要对用户的历史阅读、喜欢和收藏这几个历史记录给存起来,其实这些信息都可以从日志信息中获取得到,但是这里有个工程上的事情得先说明一下,先看下面这个图,对于每个用户点进一篇新闻的详情页
-
+
最底部有个喜欢和收藏,这个前端展示的结果是从后端获取的数据,那就意味着后端需要维护一个用户历史点击及收藏过的文章列表,这里我们使用了mysql来存储,主要是怕redis不够用。其实这两个表不仅仅可以用来前端展示用的,还可以用来分析用户的画像,这都给我们整理好了用户历史喜欢和收藏了。
@@ -622,7 +622,7 @@ echo " "
**crontab定时任务:**
-![image-20211203172613512](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203172613512.png)
+![image-20211203172613512](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203172613512.png)
将定时任务拆解一下:
diff --git a/docs/ch03/ch3.2/3.2.2.3.md b/docs/ch03/ch3.2/3.2.2.3.md
index e251e6515..e736c0fa1 100644
--- a/docs/ch03/ch3.2/3.2.2.3.md
+++ b/docs/ch03/ch3.2/3.2.2.3.md
@@ -6,7 +6,7 @@
下面主要展现的是项目的整体部分,主要分为推荐页,热门页以及新闻详情页。
-
+
diff --git a/docs/ch03/ch3.2/3.2.3.md b/docs/ch03/ch3.2/3.2.3.md
index 3beac83f1..b2cce14f1 100644
--- a/docs/ch03/ch3.2/3.2.3.md
+++ b/docs/ch03/ch3.2/3.2.3.md
@@ -1,6 +1,6 @@
-![](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片Untitled.png)
+![](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片Untitled.png)
本篇文章主要是讲解推荐系统流程构建,主要包括Offline和Online两个部分。
diff --git a/docs/ch03/ch3.2/3.2.4.3.md b/docs/ch03/ch3.2/3.2.4.3.md
index 957e4719c..8da93411b 100644
--- a/docs/ch03/ch3.2/3.2.4.3.md
+++ b/docs/ch03/ch3.2/3.2.4.3.md
@@ -12,7 +12,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
### **DSSM 模型结构**
-![image-20220224100424897](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100424897.png)
+![image-20220224100424897](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100424897.png)
上图是DSSM模型的结构,该网络结构比较简单,是一个由几层DNN组成网络,我们将要搜索文本(Query)和要匹配的文本(Document)的 embedding 输入到网络,网络输出为 128 维的向量,然后通过向量之间计算余弦相似度来计算向量之间距离,可以看作每一个 query 和 document 之间相似分数,然后在做 softmax。
@@ -28,7 +28,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
该模型主要是将上述模型中的两个“塔”改为独立的 user 和 item 两个子网络,大概结构如下:
-![img](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-f7ecbf1faf7899c6e2999182055470fb_720w.jpg)
+![img](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-f7ecbf1faf7899c6e2999182055470fb_720w.jpg)
其结构非常简单,如上图所示,左侧是用户塔,右侧是Item塔。在用户侧结构中,其输入为用户侧特征(用户画像信息、统计属性以及历史行为序列等);在用户侧结构中,其输入为Item相关特征(Item基本信息、属性信息等)。对于这两个塔本身,则是经典的DNN模型,在训练过程中,其输入由特征OneHot到特征Embedding,再经过几层DNN隐层,两个塔分别输出user embedding和item embedding,最后这两个embedding做内积或者Cosine相似度计算,使得user和item在embedding映射到共同维度的语义空间中。
@@ -38,7 +38,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
该模型主要的改进是在user塔和Item塔的特征Embedding层上,各自加入一个SENet模块,借助SENet网络用来动态地学习特征的重要性,根据得到的特征权重与对应特征的embedding相乘,进而达到放大重要特征或抑制无效特征的目的,模型大致结构如下所示:
-![img](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-8766fee1b442ed17111d5822033f960f_720w.jpg)
+![img](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-8766fee1b442ed17111d5822033f960f_720w.jpg)
其模型和朴素DSSM模型的区别在于多加了一个SENet网络,该网络主要是将特征的 embedding 通过 Squeeze 和Excitation 两个阶段得到一个权重向量,在用该向量与特征的embeding对应为相乘,挑选出最要特征之后在进入到朴素的DSSM网络中。 而 SENet 之所以起作用的原因,张俊林老师的解释是 SENet 可以突出那些对高层 User embedding 和 Item embedding 的特征交叉起重要作用的特征,更有利于表达两侧的特征交互,避免单侧无效特征经过DNN双塔非线性融合时带来的噪声,同时又带有非线性的作用。关于SENet网络详细内容可以查看[原文](https://arxiv.org/abs/1709.01507)
@@ -48,7 +48,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
该模型是Youtube于2019年在RecSys发表的一篇工作,这个模型从结构上来看是最普通的双塔。左边是user塔,输入包括两部分,第一部分是user当前正在观看的视频的特征,第二部分user的特征是用户历史行为的统计量,例如用户最近观看的N条视频的id embedding均值,这两部分融合起来一起输入user侧的输入。右边是item塔,将候选视频的特征作为输入,计算item的 embedding。之后也是再计算两侧embedding的相似度,进行学习。 模型的大致结构如下所示:
-![image-20220224100307472](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100307472.png)
+![image-20220224100307472](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100307472.png)
对于该模型,重点并不在于结构上的改变,而是对于负采样问题。因为召回的过程可以被视为是一个多分类问题,模型的输出层选择softmax计算后再计算交叉熵损失。但问题是当候选item特别多的时候,无法对所有的item进行softmax,因此通常的做法是随机从全量item中采样出一个batch的item进行softmax。但是使用batch内的样本作为彼此负样本会带来非常大的偏置问题,即对于热门的样本,被当作负样本的概率更高,因此该模型的贡献在于如何减小batch内负采样所带来的偏置问题? 关于paper的详细内容可以查看[原文](https://dl.acm.org/doi/10.1145/3298689.3346996)
diff --git a/docs/ch03/ch3.2/3.2.8.1.md b/docs/ch03/ch3.2/3.2.8.1.md
index 15c96e8c0..58553a9a9 100644
--- a/docs/ch03/ch3.2/3.2.8.1.md
+++ b/docs/ch03/ch3.2/3.2.8.1.md
@@ -45,30 +45,30 @@
- 问:在执行`Scrapy`进行新闻爬取实战的时候,写不进去`mongdb`数据库
-
+
-
+
答:`mongodb`安装是否成功?有没有报错之类的。
问:成功安装。爬虫已经成功,我看`title content`已经有数据了
-
+
答:你这里是不是什么都没有,你退出`mongo`命令行重新进入查看一下呢?
-
+
问:对,我是在`windows`下做的,还是没有
-
+
答:你看下这个路径是不是有问题,我这里好像忘记改成`fun-rec`的路径了,你改成`fun-rec`下的路径再试试,有可能这里没有的参数没有导入进去。
@@ -86,13 +86,13 @@
答:不过应该不影响,代码你是自己单独写呢?还是运行的`fun-rec`下的`code代`码?你检查下pipline下面,看参数配置是否有问题,写一点print查看一下,然后在这里单独使用`insert`方法插入点东西查看是否有问题。
-
+
问(解决):找到问题了,在`copy piplines`文件的时候,`def`类没有对齐。
-
+
- 问:`linux`一般软件安装都放在哪个目录下面啊?是`usr/local`吗?
@@ -125,16 +125,16 @@
- 问:服务没启动问题
-
+
答:对,需要安装,启动这个服务,已经加入到文档中。
-
+
-
+
- 问:`redis key`的问题如何处理?
@@ -278,7 +278,7 @@
- 问:运行后端`server`遇到过这个报错吗?
-
+
答:重新安装下`cryptography`这个包
diff --git a/docs/ch03/ch3.2/3.2.8.2.md b/docs/ch03/ch3.2/3.2.8.2.md
index 6ea8ce49b..940d6fec3 100644
--- a/docs/ch03/ch3.2/3.2.8.2.md
+++ b/docs/ch03/ch3.2/3.2.8.2.md
@@ -26,7 +26,7 @@
- 问:请问这个报错是缺少什么?
-
+
答:需要下载`drive`驱动才可以正常运行。
@@ -42,7 +42,7 @@
问:应该是有
-
+
@@ -67,7 +67,7 @@
- 问:`python process material.py`需要`redis`验证怎么解决,有没有除了取消密码之外的解决方式。
-
+
答:估计是设置了`redis`的用户和密码,这个没有办法,只能取消密码。或者修改代码,连接`redis`
@@ -98,7 +98,7 @@
答:修改此处代码。
-
+
diff --git a/docs/ch03/ch3.2/3.2.8.3.md b/docs/ch03/ch3.2/3.2.8.3.md
index 8d82a64ff..d05688ff3 100644
--- a/docs/ch03/ch3.2/3.2.8.3.md
+++ b/docs/ch03/ch3.2/3.2.8.3.md
@@ -18,7 +18,7 @@
- 问:请问这样处理会不会时间复杂度较大?
-
+
答:不容易吧,爬取的文章判断重复怎么用`id`啊?如果式唯一性`id`必然是跟时间相关的。
@@ -32,7 +32,7 @@
- 问:请教下大家,正常这两个`col`的大小是不是一样的?
-
+
答:不是一样大,你看一下具体内容就知道了,
@@ -50,7 +50,7 @@
答:这一步
-
+
@@ -65,7 +65,7 @@
问:`update_redis_mongo_protrail_data`这个函数是遍历`material_collection`,也就是`mongo_server.get_feature_protrail_collection()`也就是`featureprotrail`应该是和`featureprotrail`一样多的。
-
+
答:理解一样多没有问题,后面会修改。
@@ -74,7 +74,7 @@
- 问:用户的喜欢,收藏,点击是直接落到`mysql`里面吗?
-
+
答:是的,前端点击阅读、喜欢、收藏会实时更新。
@@ -83,7 +83,7 @@
- 问:这个关键词属于长尾是什么意思?
-
+
答:个别关键词的类别占了大量数目,以至于前三一直是那几个,长尾现象。
@@ -93,7 +93,7 @@
- 问:请教下大家,这个`user_exposure.py`是用来建`exposure_日期`这个表的么
-
+
答:是的。
\ No newline at end of file
diff --git a/docs/ch03/ch3.2/3.2.md b/docs/ch03/ch3.2/3.2.md
index bb37dc334..e73b904fd 100644
--- a/docs/ch03/ch3.2/3.2.md
+++ b/docs/ch03/ch3.2/3.2.md
@@ -11,7 +11,7 @@
**新闻推荐系统实践前端展示和后端逻辑(项目没有任何商用价值仅供入门者学习)**
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.2 \351\200\273\350\276\221\345\233\236\345\275\222.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.2 \351\200\273\350\276\221\345\233\236\345\275\222.md"
index b3abf16d1..af3404357 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.2 \351\200\273\350\276\221\345\233\236\345\275\222.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.2 \351\200\273\350\276\221\345\233\236\345\275\222.md"
@@ -8,7 +8,7 @@ f(x)=\frac{1}{1+e^{-x}}
$$
**sigmoid函数图像:**
-![Sigmoid_function](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片Sigmoid_function.png)
+![Sigmoid_function](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片Sigmoid_function.png)
由于 sigmoid 函数的定义域是 $(-∞,+∞)$,而值域为 $(0, 1)$。Logistic 回归通过 sigmoid 联结函数可以将变量映射到 $ (0, 1) $ 之间,这也是为什么最基本的 LR 分类器适合于对二分类(类 0,类 1)目标进行分类。
@@ -324,7 +324,7 @@ $$
关于该模型的详细原理和实现,可以参考资料[4]。
-![Sigmoid_function_01](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片gbdt_lr.png)
+![Sigmoid_function_01](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片gbdt_lr.png)
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.3 \347\245\236\347\273\217\347\275\221\347\273\234.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.3 \347\245\236\347\273\217\347\275\221\347\273\234.md"
index 0f2f9c426..e037e909c 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.3 \347\245\236\347\273\217\347\275\221\347\273\234.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.3 \347\245\236\347\273\217\347\275\221\347\273\234.md"
@@ -12,7 +12,7 @@
+ 轴突(Axon)可以把自身的兴奋状态从胞体传送到另一个神经元或其他组织,每个神经元只有一个轴突;
-
+
神经元可以接收其他神经元的信息,也可以发送信息给其他神经元。神经元之间没有物理连接,两个“连接”的神经元之间留有 20 纳米左右的缝隙,并靠突触进行互联来传递信息,形成一个神经网络,即神经系统。
@@ -30,7 +30,7 @@
1943 年,心理学家 McCulloch 和数学家 Pitts 根据生物神经元的结构,提出了一种非常简单的神经元模型,MP神经元。现代神经网络中的神经元和 MP 神经元的结构并无太多变化。不同的是,MP 神经元中的激活函数 $f$ 为 $0$ 或 $1$ 的阶跃函数,而现代神经元中的激活函数通常要求是连续可导的函数。
-
+
假设一个神经元接收到了 $n$ 个输入 $x_1, ... ,x_n$,令向量 $x=[x_1;x_2;...;x_n]$ 来表示这组输入,并用净输出 $z \in \mathbb{R}$ 表示一个神经元所获得的输入信号 $x$ 的加权和:
@@ -54,7 +54,7 @@ $$
理想中的激活函数为阶跃函数,它可以将输入值映射为 $0$ 或 $1$ ,这里 $1$ 对应神经元兴奋, $0$ 对应神经元抑制。但是阶跃函数具有不连续,不光滑等不太好的性质。
-![img](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片threshold.png)
+![img](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片threshold.png)
为了增强网络的表示能力和学习能力,激活函数需要具备以下几点性质:
@@ -72,7 +72,7 @@ $$
常用的 Sigmoid 函数有 Logistic 函数和 Tanh 函数,它们的形状都呈 S 型,均为两端饱和函数。所谓两端饱和,指的是当变量 $x$ 趋无穷时,函数 $f(x)$ 的导数 $f'(x)$ 趋向于 $0$ 。
-
+
+ Logistic 函数定义:
$$
@@ -155,7 +155,7 @@ $$
$$
其中, $\boldsymbol{W}^{(l)} \in \mathbb{R}^{N_{l-1} \times N_{l}}$ 为第 $l$ 层的参数矩阵,$\boldsymbol{b}^{(l)}\in \mathbb{R}^{N_{l}}$ 为第 $l$ 层的偏置向量,$f_l$ 为第 $l$ 层的激活函数,$\boldsymbol{a}^{(l)}\in \mathbb{R}^{N_{l}}$为第 $l$ 层的输出。示例如下:
-
+
这样,前馈神经网络可以通过逐层进行信息传递,得到网络最后的输出 $a^{L}$:
$$
@@ -172,7 +172,7 @@ $$
假设存在 $y=f(x)$ 的计算,则该计算的反向传播如下:
-![image-20211210204402301](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211210204402301.png)
+![image-20211210204402301](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211210204402301.png)
反正传播的计算顺序是,将信号 $E$ 乘以节点的局部导数 $\frac{\partial y}{\partial x}$, 然后将结果传递给下一个节点。
@@ -197,9 +197,9 @@ $$
$$
将链式法则用计算图表示,如下:
-
+
-
+
### 2.4.3 反向传播
@@ -231,13 +231,13 @@ y=\frac{1}{1+exp(-x)}
$$
用计算图可以表示为:
-
+
可以看到,复杂的公式经过拆解,已经没那么复杂了。现在,按照前面总结的反向传播规则,可以得到加上反向传播后的计算图:
-
+
+ 以最后一个运算 $/$ (除法)为例,令输入 $t=1+exp(-x)$,则输出 $y=\frac{1}{t}$,有:
$$
@@ -262,7 +262,7 @@ $$
对于一些层数较深的神经网络模型,在训练时可能会出现一些问题,其中就包括梯度消失问题(gradient vanishing problem)和梯度爆炸问题(gradient exploding problem)。梯度消失问题和梯度爆炸问题一般随着网络层数的增加会变得越来越明显。
-![preview](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-82873a89ff3c14c1d3b42d1862917f35_r.jpg)
+![preview](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-82873a89ff3c14c1d3b42d1862917f35_r.jpg)
我们知道前馈神经网络的传播公式如下:
$$
@@ -304,7 +304,7 @@ $$
根据Sigmoid 函数的表达式,其函数图像(左)及其导数图像(右)如下:
-
+
可以看出,Sigmoid 函数的导数取值范围在 $(0, 0.25]$ 之间,而权重矩阵在初始化时通常 $|| \boldsymbol{W}||<1$,则有 $|| f'(\cdot) \times \boldsymbol{W}||\le 0.25 $ 。由链式法则可得,由于连乘效应,梯度 $\large \frac{\partial \boldsymbol{L}}{\partial \boldsymbol{W}^{(0)}}$ 会越来越小,从而引发梯度消失的问题。
@@ -326,7 +326,7 @@ Youtube 作为全球最大的 UGC 的视频网站,需要在百万量级的视
-
+
从上图可以看出,YoutubeDNN 包含了两个阶段分别为:
@@ -341,7 +341,7 @@ Youtube 作为全球最大的 UGC 的视频网站,需要在百万量级的视
-
+
从模型的结构来看,召回阶段使用的模型并不复杂,为包含多层神经网络的 DNN 模型。下面简单分析模型的流程:
@@ -375,7 +375,7 @@ Youtube 作为全球最大的 UGC 的视频网站,需要在百万量级的视
-
+
可以看出,该阶段使用的模型与前面召回阶段相同,均为 DNN 模型。不同的是:
@@ -406,7 +406,7 @@ Wide&Deep是谷歌发表在 DLRS 2016 上的文章《Wide & Deep Learning for
Wide & Deep 已成功应用到了 Google Play 的 app 推荐业务,具体的模型结构如下:
-![image-20211214203653752](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211214203653752.png)
+![image-20211214203653752](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211214203653752.png)
从结构图上看,Wide&Deep 由两部分组成,分别为 Wide 部分和 Deep 部分。简单来说,Wide 部分就是一个线性层,Deep 部分为多层前馈神经网络层,下面先对原理进行介绍:
@@ -465,7 +465,7 @@ $$
下图,是谷歌在应用商店中的推荐模型架构:
-
+
+ Deep 部分的输入是全量的特征向量,包括用户年龄(Age)、已安装应用数量(#App Installs)、设备类型(Device Class)、已安装应用(User Installed App)、曝光应用( Impression App)等特征。已安装应用、曝光应用等类别型特征,需要经过Embedding层输入连接层(Concatenated Embedding),拼接成1200维的Embedding向量,再依次经过3层ReLU全连接层,最终输入LogLoss输出层。
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.4 \345\270\270\347\224\250\344\274\230\345\214\226\347\256\227\346\263\225.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.4 \345\270\270\347\224\250\344\274\230\345\214\226\347\256\227\346\263\225.md"
index 3abc29527..72199db46 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.4 \345\270\270\347\224\250\344\274\230\345\214\226\347\256\227\346\263\225.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.4 \345\270\270\347\224\250\344\274\230\345\214\226\347\256\227\346\263\225.md"
@@ -12,7 +12,7 @@
$$
\theta x+(1-\theta) y \in S
$$
-![convex_set](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片convex_set.png)
+![convex_set](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片convex_set.png)
### 1.1.2 凸函数
@@ -22,7 +22,7 @@ f(\lambda x+(1-\lambda)y) \le \lambda f(x)+(1-\lambda)f(y)
$$
直观上来看,对于 $z \in (x, y)$,对应的坐标点 $\left (z,f(z)\right)$ 的位置,均处于点 $(x, f(x))$ 和点 $(y,f(y))$ 连接成的线段的下方。
-![convex_func](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片convex_func.png)
+![convex_func](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片convex_func.png)
凸函数具有一个重要的性质: **局部极小值点为全局极小值点**。
@@ -169,7 +169,7 @@ $$
$$
一旦达到收敛条件的话,迭代就结束。从梯度下降法的迭代公式来看,下一个点的选择与当前点的位置和它的梯度相关。
-![optimization](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片optimization.gif)
+![optimization](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片optimization.gif)
不同的优化算法,由于优化目标函数时有着不同的出发点,所以函数在寻找局部极小值点的时对应的轨迹也有所不同。
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.5 \346\267\261\345\272\246\345\255\246\344\271\240\346\250\241\345\236\213\346\220\255\345\273\272\345\237\272\347\241\200.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.5 \346\267\261\345\272\246\345\255\246\344\271\240\346\250\241\345\236\213\346\220\255\345\273\272\345\237\272\347\241\200.md"
index b3b5025c7..453b5aee1 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.5 \346\267\261\345\272\246\345\255\246\344\271\240\346\250\241\345\236\213\346\220\255\345\273\272\345\237\272\347\241\200.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.5 \346\267\261\345\272\246\345\255\246\344\271\240\346\250\241\345\236\213\346\220\255\345\273\272\345\237\272\347\241\200.md"
@@ -122,7 +122,7 @@ model = keras.Model(
keras.utils.plot_model(model, "multi_input_and_output_model.png", show_shapes=True)
```
-![image-20210226174510287](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20210226174510287.png)
+![image-20210226174510287](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20210226174510287.png)
从上面这个图就可以看出,模型多输入,多输出,共享层的结构,并且也会发现搭建的过程也是非常的简单。
@@ -134,7 +134,7 @@ keras.utils.plot_model(model, "multi_input_and_output_model.png", show_shapes=Tr
先说答案:将输入的数据转换成字典的形式,定义输入层的时候让输入层的name和字典中特征的key一致,就可以使得输入的数据和对应的Input层对应,后面搭建模型就是和上面介绍的一样的了。
-
+
直接看个例子吧:
@@ -176,7 +176,7 @@ model.fit(x, y, batch_size=1, epochs=2, validation_split=0.2)
keras.utils.plot_model(model, "multi_input_and_output_model.png", show_shapes=True)
```
-
+
上面就是举了个简单的例子说明,当多输入特别多的时候,构建模型我们可以将数据转换成字典的形式,然后字典中特征的名称与其对应的Input层的名称一致就行,这里是为了后面搭建复杂模型打基础。
@@ -184,7 +184,7 @@ keras.utils.plot_model(model, "multi_input_and_output_model.png", show_shapes=Tr
相信大家对DeepCTR开源项目应该是有点了解,DeepCTR通过对现有的基于深度学习的点击率预测模型的结构进行抽象总结,在设计过程中采用模块化的思路,各个模块自身具有高复用性,各个模块之间互相独立。 基于深度学习的点击率预测模型按模型内部组件的功能可以划分成以下4个模块:输入模块,嵌入模块,特征提取模块,预测输出模块。关于DeepCTR的介绍可以参考这个文章[DeepCTR:易用可扩展的深度学习点击率预测算法包](https://zhuanlan.zhihu.com/p/53231955)
-
+
这个开源项目做的非常好反而不是特别适合初学者学习,但是又非常适合推荐系统领域的小白去学习,所以本次内容设计我们借鉴了DeepCTR的设计思想,复现课程中的代码,复现的代码中包含了大量的注释,使得学习者在了解了上述所说的函数式API构建模型的基础上,快速看懂源码的设计,以及模型的原理。下面主要说一下我们代码参考DeepCTR项目实现需要注意的几个点。
@@ -232,7 +232,7 @@ keras.utils.plot_model(model, "multi_input_and_output_model.png", show_shapes=Tr
上面在说了类别特征和可变长的序列特征,在这两个Input层之后都需要将其转化成Embedding向量或者Embedding矩阵,在keras中转化成Embedding向量和Embedding矩阵只是相差一个参数的问题
-
+
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.6 Word2vec.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.6 Word2vec.md"
index 2ed19ec8b..2883242cf 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.6 Word2vec.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.0 \346\234\272\345\231\250\345\255\246\344\271\240\345\237\272\347\241\200/1.0.6 Word2vec.md"
@@ -56,7 +56,7 @@ one-hot向量的维度是词汇表的大小(如:500,000)
如果我们可以使用某种方法为每个单词构建一个合适的dense vector,如下图,那么通过点积等数学计算就可以获得单词之间的某种联系
-
+
# Word2vec
@@ -71,7 +71,7 @@ one-hot向量的维度是词汇表的大小(如:500,000)
我们先引入上下文context的概念:当单词 w 出现在文本中时,其**上下文context**是出现在w附近的一组单词(在固定大小的窗口内),如下图
-
+
这些上下文单词context words决定了banking的意义
@@ -97,13 +97,13 @@ Word2vec包含两个模型,**Skip-gram与CBOW**。下面,我们先讲**Skip-
下图展示了以“into”为中心词,窗口大小为2的情况下它的上下文词。以及相对应的$P(o|c)$
-
+
我们滑动窗口,再以banking为中心词
-
+
那么,如果我们在整个语料库上不断地滑动窗口,我们可以得到所有位置的$P(o|c)$,我们希望在所有位置上**最大化单词o在单词c周围出现了这一事实**,由极大似然法,可得:
@@ -115,13 +115,13 @@ $$
此式还可以依图3写为:
-
+
加log,加负号,缩放大小可得:
-
+
上式即为**skip-gram的损失函数**,最小化损失函数,就可以得到合适的词向量
@@ -141,7 +141,7 @@ $$
又P(o|c)是一个概率,所以我们在整个语料库上使用**softmax**将点积的值映射到概率,如图6
-
+
注:注意到上图,中心词词向量为$v_{c}$,而上下文词词向量为$u_{o}$。也就是说每个词会对应两个词向量,**在词w做中心词时,使用$v_{w}$作为词向量,而在它做上下文词时,使用$u_{w}$作为词向量**。这样做的原因是为了求导等操作时计算上的简便。当整个模型训练完成后,我们既可以使用$v_{w}$作为词w的词向量,也可以使用$u_{w}$作为词w的词向量,亦或是将二者平均。在下一部分的模型结构中,我们将更清楚地看到两个词向量究竟在模型的哪个位置。
@@ -153,7 +153,7 @@ $$
## Word2vec模型结构
-
+
如图八所示,这是一个输入为1 X V维的one-hot向量(V为整个词汇表的长度,这个向量只有一个1值,其余为0值表示一个词),单隐藏层(**隐藏层的维度为N,这里是一个超参数,这个参数由我们定义,也就是词向量的维度**),输出为1 X V维的softmax层的模型。
@@ -175,13 +175,13 @@ $W^{I}$为V X N的参数矩阵,$W^{O}$为N X V的参数矩阵。
如上文所述,Skip-gram为给定中心词,预测周围的词,即求P(o|c),如下图所示:
-
+
而CBOW为给定周围的词,预测中心词,即求P(c|o),如下图所示:
-
+
@@ -194,7 +194,7 @@ $W^{I}$为V X N的参数矩阵,$W^{O}$为N X V的参数矩阵。
我们再看一眼,通过softmax得到的$P(o|c)$,如图:
-
+
@@ -209,7 +209,7 @@ $W^{I}$为V X N的参数矩阵,$W^{O}$为N X V的参数矩阵。
我们首先给出负采样的损失函数:
-
+
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.1 \346\246\202\350\277\260.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.1 \346\246\202\350\277\260.md"
index 3c75e7052..45bbe183e 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.1 \346\246\202\350\277\260.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.1 \346\246\202\350\277\260.md"
@@ -7,8 +7,8 @@
**传统推荐系统及深度学习推荐系统的演化关系图(图来自《深度学习推荐系统》)**
传统推荐系统(左),深度学习推荐系统(右)
@@ -119,7 +119,7 @@
在讲AUC前需要理解混淆矩阵,召回率,精确率,ROC曲线等概念
-
+
TP:真的真了(真实值是真的,预测也是真)
@@ -136,7 +136,7 @@
$$
ROC(**Receiver Operating Characteristic Curve**)曲线:
-
+
ROC曲线的横坐标为假阳性率(False Positive Rate, FPR),N是真实负样本的个数, FP是N个负样本中被分类器预测为正样本的个数。
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.2 \345\215\217\345\220\214\350\277\207\346\273\244-UserCF.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.2 \345\215\217\345\220\214\350\277\207\346\273\244-UserCF.md"
index 2a34353e3..72a3c8617 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.2 \345\215\217\345\220\214\350\277\207\346\273\244-UserCF.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.2 \345\215\217\345\220\214\350\277\207\346\273\244-UserCF.md"
@@ -99,13 +99,13 @@
+ 例如,我们要对用户 $A$ 进行物品推荐,可以先找到和他有相似兴趣的其他用户。
+ 然后,将共同兴趣用户喜欢的,但用户 $A$ 未交互过的物品推荐给 $A$。
-
+
## 计算过程
以下图为例,给用户推荐物品的过程可以形象化为一个猜测用户对物品进行打分的任务,表格里面是5个用户对于5件物品的一个打分情况,就可以理解为用户对物品的喜欢程度。
-![image-20210629232622758](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20210629232622758.png)
+![image-20210629232622758](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20210629232622758.png)
UserCF算法的两个步骤:
@@ -164,7 +164,7 @@ UserCF算法的两个步骤:
+ 基于 sklearn 计算所有用户之间的皮尔逊相关系数。可以看出,与 Alice 相似度最高的用户为用户1和用户2。
-
+
2. **根据相似度用户计算 Alice对物品5的最终得分**
用户1对物品5的评分是3, 用户2对物品5的打分是5, 那么根据上面的计算公式, 可以计算出 Alice 对物品5的最终得分是
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.3 \345\215\217\345\220\214\350\277\207\346\273\244-ItemCF.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.3 \345\215\217\345\220\214\350\277\207\346\273\244-ItemCF.md"
index 14c97119d..97ff8df44 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.3 \345\215\217\345\220\214\350\277\207\346\273\244-ItemCF.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/1.1.3 \345\215\217\345\220\214\350\277\207\346\273\244-ItemCF.md"
@@ -9,13 +9,13 @@
举例来说,如果用户 1 喜欢物品 A ,而物品 A 和 C 非常相似,则可以将物品 C 推荐给用户1。ItemCF算法并不利用物品的内容属性计算物品之间的相似度, 主要通过分析用户的行为记录计算物品之间的相似度, 该算法认为, 物品 A 和物品 C 具有很大的相似度是因为喜欢物品 A 的用户极可能喜欢物品 C。
-![图片](http://ryluo.oss-cn-chengdu.aliyuncs.com/JavagdvaYX0HSW4PdssV.png!thumbnail)
+![图片](https://ryluo.oss-cn-chengdu.aliyuncs.com/JavagdvaYX0HSW4PdssV.png!thumbnail)
## 计算过程
基于物品的协同过滤算法和基于用户的协同过滤算法很像, 所以我们这里直接还是拿上面 Alice 的那个例子来看。
-![图片](http://ryluo.oss-cn-chengdu.aliyuncs.com/JavaE306yXB4mGmjIxbn.png!thumbnail)
+![图片](https://ryluo.oss-cn-chengdu.aliyuncs.com/JavaE306yXB4mGmjIxbn.png!thumbnail)
如果想知道 Alice 对物品5打多少分, 基于物品的协同过滤算法会这么做:
@@ -41,7 +41,7 @@
2. 基于 `sklearn` 计算物品之间的皮尔逊相关系数:
-
+
3. 根据皮尔逊相关系数, 可以找到与物品5最相似的2个物品是 item1 和 item4, 下面基于上面的公式计算最终得分:
@@ -196,7 +196,7 @@ $$
比如下面这个例子:
-![图片](http://ryluo.oss-cn-chengdu.aliyuncs.com/JavaxxhHm3BAtMfsy2AV.png!thumbnail)
+![图片](https://ryluo.oss-cn-chengdu.aliyuncs.com/JavaxxhHm3BAtMfsy2AV.png!thumbnail)
+ 左边矩阵中,$A, B, C, D$ 表示的是物品。
+ 可以看出,$D $ 是一件热门物品,其与 $A、B、C$ 的相似度比较大。因此,推荐系统更可能将 $D$ 推荐给用过 $A、B、C$ 的用户。
@@ -242,7 +242,7 @@ $$
>
> 举例来说明,如下图(`X,Y,Z` 表示物品,`d,e,f`表示用户):
>
-> ![图片](http://ryluo.oss-cn-chengdu.aliyuncs.com/JavaWKvITKBhYOkfXrzs.png!thumbnail)
+> ![图片](https://ryluo.oss-cn-chengdu.aliyuncs.com/JavaWKvITKBhYOkfXrzs.png!thumbnail)
>
> + 如果使用余弦相似度进行计算,用户 d 和 e 之间较为相似。但是实际上,用户 d 和 f 之间应该更加相似。只不过由于 d 倾向于打高分,e 倾向于打低分导致二者之间的余弦相似度更高。
> + 这种情况下,可以考虑使用皮尔逊相关系数计算用户之间的相似性关系。
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/readme.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/readme.md"
index 4a7245455..8deeac97c 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/readme.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.1 \345\237\272\347\241\200\346\216\250\350\215\220\347\256\227\346\263\225/readme.md"
@@ -19,11 +19,11 @@
传统推荐系统:
-![](http://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20200923143443499.png)
+![](https://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20200923143443499.png)
深度学习推荐系统:
-![](http://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20200923143559968.png)
+![](https://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20200923143559968.png)
**本开源内容的目标是掌握以下算法:**
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.1 NeuralCF.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.1 NeuralCF.md"
index 1a110dc75..1658781ea 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.1 NeuralCF.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.1 NeuralCF.md"
@@ -86,11 +86,11 @@ def NCF(dnn_feature_columns):
为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。
-![image-20210307191533086](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20210307191533086.png)
+![image-20210307191533086](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20210307191533086.png)
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-![NCF](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片NCF.png)
+![NCF](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片NCF.png)
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.10 DIEN.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.10 DIEN.md"
index cc24dde81..d2e865ef6 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.10 DIEN.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.10 DIEN.md"
@@ -8,7 +8,7 @@ DIN模型考虑了用户兴趣,并且强调用户兴趣是多样的,该模
## 2. DIEN模型原理
-
+
模型的输入可以分成两大部分,一部分是用户的行为序列(这部分会通过兴趣提取层及兴趣演化层转换成与用户当前兴趣相关的embedding),另一部分就是除了用户行为以外的其他所有特征,如Target id, Coontext Feature, UserProfile Feature,这些特征都转化成embedding的类型然后concat在一起(形成一个大的embedding)作为非行为相关的特征(这里可能也会存在一些非id类特征,应该可以直接进行concat)。最后DNN输入的部分由行为序列embedding和非行为特征embedding(多个特征concat到一起之后形成的一个大的向量)组成,将两者concat之后输入到DNN中。
@@ -26,11 +26,11 @@ DIN模型考虑了用户兴趣,并且强调用户兴趣是多样的,该模
首先需要明确的就是辅助损失是计算哪两个量的损失。计算的是用户每个时刻的兴趣表示(GRU每个时刻输出的隐藏状态形成的序列)与用户当前时刻实际点击的物品表示(输入的embedding序列)之间的损失,相当于是行为序列中的第t+1个物品与用户第t时刻的兴趣表示之间的损失**(为什么这里用户第t时刻的兴趣与第t+1时刻的真实点击做损失呢?我的理解是,只有知道了用户第t+1真实点击的商品,才能更好的确定用户第t时刻的兴趣)。**
-
+
当然,如果只计算用户点击物品与其点击前一次的兴趣之间的损失,只能认为是正样本之间的损失,那么用户第t时刻的兴趣其实还有很多其他的未点击的商品,这些未点击的商品就是负样本,负样本一般通过从用户点击序列中采样得到,这样一来辅助损失中就包含了用户某个时刻下的兴趣及与该时刻兴趣相关的正负物品。所以最终的损失函数表示如下。
-
+
其中$h_t^i$表示的是用户$i$第$t$时刻的隐藏状态,可以表示用户第$t$时刻的兴趣向量,$e_b^i,\hat{e_b^i}$分别表示的是正负样本,$e_b^i[t+1]$表示的是用户$i$第$t+1$时刻点击的物品向量。
@@ -61,7 +61,7 @@ $$
由于用户的兴趣是多样的,但是用户的每一种兴趣都有自己的发展过程,即使兴趣发生漂移我们可以只考虑用户与target item(广告或者商品)相关的兴趣演化过程,这样就不用考虑用户多样化的兴趣的问题了,而如何只获取与target item相关的信息,作者使用了与DIN模型中提取与target item相同的方法,来计算用户历史兴趣与target item之间的相似度,即这里也使用了DIN中介绍的局部激活单元(就是下图中的Attention模块)。
-
+
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.2 DeepCrossing.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.2 DeepCrossing.md"
index 0583e6b57..db001c511 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.2 DeepCrossing.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.2 DeepCrossing.md"
@@ -18,7 +18,7 @@ DeepCrossing模型应用场景是微软搜索引擎Bing中的搜索广告推荐
DeepCrossing分别设置了不同神经网络层解决上述问题。模型结构如下
-
+
下面分别介绍一下各层的作用:
@@ -48,7 +48,7 @@ dnn_inputs = Concatenate(axis=1)([dense_dnn_inputs, sparse_dnn_inputs]) # B x (n
该层的主要结构是MLP, 但DeepCrossing采用了残差网络进行的连接。通过多层残差网络对特征向量各个维度充分的交叉组合, 使得模型能够抓取更多的非线性特征和组合特征信息, 增加模型的表达能力。残差网络结构如下图所示:
-
+
Deep Crossing模型使用稍微修改过的残差单元,它不使用卷积内核,改为了两层神经网络。我们可以看到,残差单元是通过两层ReLU变换再将原输入特征相加回来实现的。具体代码实现如下:
@@ -136,11 +136,11 @@ def DeepCrossing(dnn_feature_columns):
为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-![DeepCrossing](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片DeepCrossing.png)
+![DeepCrossing](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片DeepCrossing.png)
## 5. 参考资料
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.3 PNN.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.3 PNN.md"
index 7b9bb4ccb..c513ee8d0 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.3 PNN.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.3 PNN.md"
@@ -12,13 +12,13 @@ PNN模型其实是对IPNN和OPNN的总称,两者分别对应的是不同的Pro
PNN模型的整体架构如下图所示:
-
+
一共分为五层,其中除了Product Layer别的layer都是比较常规的处理方法,均可以从前面的章节进一步了解。模型中最重要的部分就是通过Product层对embedding特征进行交叉组合,也就是上图中红框所显示的部分。
Product层主要有线性部分和非线性部分组成,分别用$l_z$和$l_p$来表示,
-
+
1. 线性模块,一阶特征(未经过显示特征交叉处理),对应论文中的$l_z=(l_z^1,l_z^2, ..., l_z^{D_1})$
2. 非线性模块,高阶特征(经过显示特征交叉处理),对应论文中的$l_p=(l_p^1,l_p^2, ..., l_p^{D_1})$
@@ -236,7 +236,7 @@ class ProductLayer(Layer):
下面是一个通过keras画的模型结构图,为了更好的显示,类别特征都只是选择了一小部分,画图的代码也在github中。
-![PNN](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片PNN.png)
+![PNN](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片PNN.png)
## 4. 思考题
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.4 Wide&Deep.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.4 Wide&Deep.md"
index 4c5e7d8f3..1078d72e9 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.4 Wide&Deep.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.4 Wide&Deep.md"
@@ -14,7 +14,7 @@ Wide&Deep模型就是围绕记忆性和泛化性进行讨论的,模型能够
## 2. 模型结构及原理
-
+
其实wide&deep模型本身的结构是非常简单的,对于有点机器学习基础和深度学习基础的人来说都非常的容易看懂,但是如何根据自己的场景去选择那些特征放在Wide部分,哪些特征放在Deep部分就需要理解这篇论文提出者当时对于设计该模型不同结构时的意图了,所以这也是用好这个模型的一个前提。
@@ -92,11 +92,11 @@ def WideNDeep(linear_feature_columns, dnn_feature_columns):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-![Wide&Deep](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片Wide&Deep.png)
+![Wide&Deep](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片Wide&Deep.png)
## 4. 思考
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.5 DeepFM.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.5 DeepFM.md"
index 78e62ceb8..f4f569c95 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.5 DeepFM.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.5 DeepFM.md"
@@ -9,15 +9,15 @@
- ==DNN局限==
当我们使用DNN网络解决推荐问题的时候存在网络参数过于庞大的问题,这是因为在进行特征处理的时候我们需要使用one-hot编码来处理离散特征,这会导致输入的维度猛增。这里借用AI大会的一张图片:
-
+
这样庞大的参数量也是不实际的。为了解决DNN参数量过大的局限性,可以采用非常经典的Field思想,将OneHot特征转换为Dense Vector
-
+
此时通过增加全连接层就可以实现高阶的特征组合,如下图所示:
-
+
但是仍然缺少低阶的特征组合,于是增加FM来表示低阶的特征组合。
@@ -25,7 +25,7 @@
结合FM和DNN其实有两种方式,可以并行结合也可以串行结合。这两种方式各有几种代表模型。在DeepFM之前有FNN,虽然在影响力上可能并不如DeepFM,但是了解FNN的思想对我们理解DeepFM的特点和优点是很有帮助的。
-
+
FNN是使用预训练好的FM模块,得到隐向量,然后把隐向量作为DNN的输入,但是经过实验进一步发现,在Embedding layer和hidden layer1之间增加一个product层(如上图所示)可以提高模型的表现,所以提出了PNN,使用product layer替换FM预训练层。
@@ -33,7 +33,7 @@ FNN是使用预训练好的FM模块,得到隐向量,然后把隐向量作为
FNN和PNN模型仍然有一个比较明显的尚未解决的缺点:对于低阶组合特征学习到的比较少,这一点主要是由于FM和DNN的串行方式导致的,也就是虽然FM学到了低阶特征组合,但是DNN的全连接结构导致低阶特征并不能在DNN的输出端较好的表现。看来我们已经找到问题了,将串行方式改进为并行方式能比较好的解决这个问题。于是Google提出了Wide&Deep模型(将前几章),但是如果深入探究Wide&Deep的构成方式,虽然将整个模型的结构调整为了并行结构,在实际的使用中Wide Module中的部分需要较为精巧的特征工程,换句话说人工处理对于模型的效果具有比较大的影响(这一点可以在Wide&Deep模型部分得到验证)。
-
+
如上图所示,该模型仍然存在问题:**在output Units阶段直接将低阶和高阶特征进行组合,很容易让模型最终偏向学习到低阶或者高阶的特征,而不能做到很好的结合。**
@@ -41,7 +41,7 @@ FNN和PNN模型仍然有一个比较明显的尚未解决的缺点:对于低
## 2. 模型的结构与原理
-
+
前面的Field和Embedding处理是和前面的方法是相同的,如上图中的绿色部分;DeepFM将Wide部分替换为了FM layer如上图中的蓝色部分
@@ -57,13 +57,13 @@ FNN和PNN模型仍然有一个比较明显的尚未解决的缺点:对于低
$$
\hat{y}_{FM}(x) = w_0+\sum_{i=1}^N w_ix_i + \sum_{i=1}^N \sum_{j=i+1}^N v_i^T v_j x_ix_j
$$
-
+
### 2.2 Deep
Deep架构图
-
+
Deep Module是为了学习高阶的特征组合,在上图中使用用全连接的方式将Dense Embedding输入到Hidden Layer,这里面Dense Embeddings就是为了解决DNN中的参数爆炸问题,这也是推荐模型中常用的处理方法。
@@ -132,11 +132,11 @@ def DeepFM(linear_feature_columns, dnn_feature_columns):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-![DeepFM](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片DeepFM.png)
+![DeepFM](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片DeepFM.png)
@@ -146,7 +146,7 @@ def DeepFM(linear_feature_columns, dnn_feature_columns):
2. 对于下图所示,根据你的理解Sparse Feature中的不同颜色节点分别表示什么意思
-
+
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.6 NFM.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.6 NFM.md"
index c569925de..fdd26f9a9 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.6 NFM.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.6 NFM.md"
@@ -10,11 +10,11 @@ $$
$$
我们对比FM, 就会发现变化的是第三项,前两项还是原来的, 因为我们说FM的一个问题,就是只能到二阶交叉, 且是线性模型, 这是他本身的一个局限性, 而如果想突破这个局限性, 就需要从他的公式本身下点功夫, 于是乎,作者在这里改进的思路就是**用一个表达能力更强的函数来替代原FM中二阶隐向量内积的部分**。
-
+
而这个表达能力更强的函数呢, 我们很容易就可以想到神经网络来充当,因为神经网络理论上可以拟合任何复杂能力的函数, 所以作者真的就把这个$f(x)$换成了一个神经网络,当然不是一个简单的DNN, 而是依然底层考虑了交叉,然后高层使用的DNN网络, 这个也就是我们最终的NFM网络了:
-
+
这个结构,如果前面看过了PNN的伙伴会发现,这个结构和PNN非常像,只不过那里是一个product_layer, 而这里换成了Bi-Interaction Pooling了, 这个也是NFM的核心结构了。这里注意, 这个结构中,忽略了一阶部分,只可视化出来了$f(x)$, 我们还是下面从底层一点点的对这个网络进行剖析。
@@ -137,11 +137,11 @@ def NFM(linear_feature_columns, dnn_feature_columns):
有了上面的解释,这个模型的宏观层面相信就很容易理解了。关于这每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-![nfm](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片nfm.png)
+![nfm](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片nfm.png)
## 4. 思考题
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.7 DCN.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.7 DCN.md"
index 023824d7c..483b4fa76 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.7 DCN.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.7 DCN.md"
@@ -8,7 +8,7 @@ Wide&Deep模型的提出不仅综合了“记忆能力”和“泛化能力”
这个模型的结构是这个样子的:
-
+
这个模型的结构也是比较简洁的, 从下到上依次为:Embedding和Stacking层, Cross网络层与Deep网络层并列, 以及最后的输出层。下面也是一一为大家剖析。
@@ -35,7 +35,7 @@ $$
$$
可以看到, 交叉层的二阶部分非常类似PNN提到的外积操作, 在此基础上增加了外积操作的权重向量$w_l$, 以及原输入向量$x_l$和偏置向量$b_l$。 交叉层的可视化如下:
-
+
可以看到, 每一层增加了一个$n$维的权重向量$w_l$(n表示输入向量维度), 并且在每一层均保留了输入向量, 因此输入和输出之间的变化不会特别明显。关于这一层, 原论文里面有个具体的证明推导Cross Network为啥有效, 不过比较复杂,这里我拿一个式子简单的解释下上面这个公式的伟大之处:
@@ -139,7 +139,7 @@ def DCN(linear_feature_columns, dnn_feature_columns):
下面是一个通过keras画的模型结构图,为了更好的显示,类别特征都只是选择了一小部分,画图的代码也在github中。
-![DCN](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片DCN.png)
+![DCN](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片DCN.png)
## 4. 思考
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.8 AFM.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.8 AFM.md"
index 59c749d9f..c1999ee6e 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.8 AFM.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.8 AFM.md"
@@ -10,7 +10,7 @@ $$
## 2. AFM模型原理
-
+
上图表示的就是AFM交叉特征部分的模型结构(非交叉部分与FM是一样的,图中并没有给出)。AFM最核心的两个点分别是Pair-wise Interaction Layer和Attention-based Pooling。前者将输入的非零特征的隐向量两两计算element-wise product(哈达玛积,两个向量对应元素相乘,得到的还是一个向量),假如输入的特征中的非零向量的数量为m,那么经过Pair-wise Interaction Layer之后输出的就是$\frac{m(m-1)}{2}$个向量,再将前面得到的交叉特征向量组输入到Attention-based Pooling,该pooling层会先计算出每个特征组合的自适应权重(通过Attention Net进行计算),通过加权求和的方式将向量组压缩成一个向量,由于最终需要输出的是一个数值,所以还需要将前一步得到的向量通过另外一个向量将其映射成一个值,得到最终的基于注意力加权的二阶交叉特征的输出。(对于这部分如果不是很清楚,可以先看下面对两个核心层的介绍)
@@ -109,11 +109,11 @@ def AFM(linear_feature_columns, dnn_feature_columns):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-![AFM](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片AFM.png)
+![AFM](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片AFM.png)
## 4. 思考
diff --git "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.9 DIN.md" "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.9 DIN.md"
index 34cf89f8c..083521174 100644
--- "a/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.9 DIN.md"
+++ "b/docs/\347\254\254\344\270\200\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\237\272\347\241\200/1.2 \346\267\261\345\272\246\346\216\250\350\215\220\346\250\241\345\236\213/1.2.9 DIN.md"
@@ -159,11 +159,11 @@ def DIN(feature_columns, behavior_feature_list, behavior_seq_feature_list):
关于每一块的细节,这里就不解释了,在我们给出的GitHub代码中,我们已经加了非常详细的注释,大家看那个应该很容易看明白, 为了方便大家的阅读,我们这里还给大家画了一个整体的模型架构图,帮助大家更好的了解每一块以及前向传播。(画的图不是很规范,先将就看一下,后面我们会统一在优化一下这个手工图)。
-
+
下面是一个通过keras画的模型结构图,为了更好的显示,数值特征和类别特征都只是选择了一小部分,画图的代码也在github中。
-![din](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片din.png)
+![din](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片din.png)
## 思考
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb"
index 1dae6308d..1567babe2 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.ipynb"
@@ -1,664 +1,664 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 赛题理解\n",
- "赛题理解是切入一道赛题的基础,会影响后续特征工程和模型构建等各种工作,也影响着后续发展工作的方向,正确了解赛题背后的思想以及赛题业务逻辑的清晰,有利于花费更少时间构建更为有效的特征模型, 在各种比赛中, 赛题理解都是极其重要且必须走好的第一步, 今天我们就从赛题的理解出发, 首先了解一下这次赛题的概况和数据,从中分析赛题以及大致的处理方式, 其次我们了解模型评测的指标,最后对赛题的理解整理一些经验。\n",
- "\n",
- "## 赛题简介\n",
- "此次比赛是新闻推荐场景下的用户行为预测挑战赛, 该赛题是以新闻APP中的新闻推荐为背景, 目的是**要求我们根据用户历史浏览点击新闻文章的数据信息预测用户未来的点击行为, 即用户的最后一次点击的新闻文章**, 这道赛题的设计初衷是引导大家了解推荐系统中的一些业务背景, 解决实际问题。 \n",
- "\n",
- "## 数据概况\n",
- "该数据来自某新闻APP平台的用户交互数据,包括30万用户,近300万次点击,共36万多篇不同的新闻文章,同时每篇新闻文章有对应的embedding向量表示。为了保证比赛的公平性,从中抽取20万用户的点击日志数据作为训练集,5万用户的点击日志数据作为测试集A,5万用户的点击日志数据作为测试集B。具体数据表和参数, 大家可以参考赛题说明。下面说一下拿到这样的数据如何进行理解, 来有效的开展下一步的工作。 \n",
- "## 评价方式理解\n",
- "理解评价方式, 我们需要结合着最后的提交文件来看, 根据sample.submit.csv, 我们最后提交的格式是针对每个用户, 我们都会给出五篇文章的推荐结果,按照点击概率从前往后排序。 而真实的每个用户最后一次点击的文章只会有一篇的真实答案, 所以我们就看我们推荐的这五篇里面是否有命中真实答案的。比如对于user1来说, 我们的提交会是:\n",
- ">user1, article1, article2, article3, article4, article5.\n",
- "\n",
- "评价指标的公式如下:\n",
- "$$\n",
- "score(user) = \\sum_{k=1}^5 \\frac{s(user, k)}{k}\n",
- "$$\n",
- "\n",
- "假如article1就是真实的用户点击文章,也就是article1命中, 则s(user1,1)=1, s(user1,2-4)都是0, 如果article2是用户点击的文章, 则s(user,2)=1/2,s(user,1,3,4,5)都是0。也就是score(user)=命中第几条的倒数。如果都没中, 则score(user1)=0。 这个是合理的, 因为我们希望的就是命中的结果尽量靠前, 而此时分数正好比较高。\n",
- "\n",
- "## 赛题理解\n",
- "根据赛题简介,我们首先要明确我们此次比赛的目标: 根据用户历史浏览点击新闻的数据信息预测用户最后一次点击的新闻文章。从这个目标上看, 会发现此次比赛和我们之前遇到的普通的结构化比赛不太一样, 主要有两点:\n",
- " \n",
- "- 首先是目标上, 要预测最后一次点击的新闻文章,也就是我们给用户推荐的是新闻文章, 并不是像之前那种预测一个数或者预测数据哪一类那样的问题\n",
- "- 数据上, 通过给出的数据我们会发现, 这种数据也不是我们之前遇到的那种特征+标签的数据,而是基于了真实的业务场景, 拿到的用户的点击日志\n",
- "\n",
- "所以拿到这个题目,我们的思考方向就是结合我们的目标,**把该预测问题转成一个监督学习的问题(特征+标签),然后我们才能进行ML,DL等建模预测**。那么我们自然而然的就应该在心里会有这么几个问题:如何转成一个监督学习问题呢? 转成一个什么样的监督学习问题呢? 我们能利用的特征又有哪些呢? 又有哪些模型可以尝试呢? 此次面对数万级别的文章推荐,我们又有哪些策略呢? \n",
- "\n",
- "当然这些问题不会在我们刚看到赛题之后就一下出来答案, 但是只要有了问题之后, 我们就能想办法解决问题了, 比如上面的第二个问题,转成一个什么样的监督学习问题? 由于我们是预测用户最后一次点击的新闻文章,从36万篇文章中预测某一篇的话我们首先可能会想到这可能是一个多分类的问题(36万类里面选1), 但是如此庞大的分类问题, 我们做起来可能比较困难, 那么能不能转化一下? 既然是要预测最后一次点击的文章, 那么如果我们能预测出某个用户最后一次对于某一篇文章会进行点击的概率, 是不是就间接性的解决了这个问题呢?概率最大的那篇文章不就是用户最后一次可能点击的新闻文章吗? 这样就把原问题变成了一个点击率预测的问题(用户, 文章) --> 点击的概率(软分类), 而这个问题, 就是我们所熟悉的监督学习领域分类问题了, 这样我们后面建模的时候, 对于模型的选择就基本上有大致方向了,比如最简单的逻辑回归模型。 \n",
- "这样, 我们对于该赛题的解决方案应该有了一个大致的解决思路,要先转成一个分类问题来做, 而分类的标签就是用户是否会点击某篇文章,分类问题的特征中会有用户和文章,我们要训练一个分类模型, 对某用户最后一次点击某篇文章的概率进行预测。 那么又会有几个问题:如何转成监督学习问题? 训练集和测试集怎么制作? 我们又能利用哪些特征? 我们又可以尝试哪些模型? 面对36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模?如何进行最后的预测? "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Baseline"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:46:49.678700Z",
- "start_time": "2020-11-16T07:46:49.673336Z"
- }
- },
- "outputs": [],
- "source": [
- "# import packages\n",
- "import time, math, os\n",
- "from tqdm import tqdm\n",
- "import gc\n",
- "import pickle\n",
- "import random\n",
- "from datetime import datetime\n",
- "from operator import itemgetter\n",
- "import numpy as np\n",
- "import pandas as pd\n",
- "import warnings\n",
- "from collections import defaultdict\n",
- "import collections\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:48:34.240098Z",
- "start_time": "2020-11-16T07:48:34.236370Z"
- }
- },
- "outputs": [],
- "source": [
- "# data_path = './data_raw/'\n",
- "data_path = '/home/admin/jupyter/data/' # 天池平台路径\n",
- "save_path = '/home/admin/jupyter/temp_results/' # 天池平台路径"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## df节省内存函数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 节约内存的一个标配函数\n",
- "def reduce_mem(df):\n",
- " starttime = time.time()\n",
- " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
- " start_mem = df.memory_usage().sum() / 1024**2\n",
- " for col in df.columns:\n",
- " col_type = df[col].dtypes\n",
- " if col_type in numerics:\n",
- " c_min = df[col].min()\n",
- " c_max = df[col].max()\n",
- " if pd.isnull(c_min) or pd.isnull(c_max):\n",
- " continue\n",
- " if str(col_type)[:3] == 'int':\n",
- " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
- " df[col] = df[col].astype(np.int8)\n",
- " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
- " df[col] = df[col].astype(np.int16)\n",
- " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
- " df[col] = df[col].astype(np.int32)\n",
- " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
- " df[col] = df[col].astype(np.int64)\n",
- " else:\n",
- " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
- " df[col] = df[col].astype(np.float16)\n",
- " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
- " df[col] = df[col].astype(np.float32)\n",
- " else:\n",
- " df[col] = df[col].astype(np.float64)\n",
- " end_mem = df.memory_usage().sum() / 1024**2\n",
- " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
- " 100*(start_mem-end_mem)/start_mem,\n",
- " (time.time()-starttime)/60))\n",
- " return df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取采样或全量数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:48:50.619963Z",
- "start_time": "2020-11-16T07:48:50.611667Z"
- }
- },
- "outputs": [],
- "source": [
- "# debug模式:从训练集中划出一部分数据来调试代码\n",
- "def get_all_click_sample(data_path, sample_nums=10000):\n",
- " \"\"\"\n",
- " 训练集中采样一部分数据调试\n",
- " data_path: 原数据的存储路径\n",
- " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
- " \"\"\"\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " all_user_ids = all_click.user_id.unique()\n",
- "\n",
- " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
- " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click\n",
- "\n",
- "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
- "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
- "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
- " if offline:\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " else:\n",
- " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
- "\n",
- " all_click = trn_click.append(tst_click)\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 全量训练集\n",
- "all_click_df = get_all_click_df(data_path, offline=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 获取 用户 - 文章 - 点击时间字典"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:56:39.800240Z",
- "start_time": "2020-11-16T07:56:39.793541Z"
- }
- },
- "outputs": [],
- "source": [
- "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- "def get_user_item_time(click_df):\n",
- " \n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " \n",
- " def make_item_time_pair(df):\n",
- " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
- " \n",
- " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
- " .reset_index().rename(columns={0: 'item_time_list'})\n",
- " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
- " \n",
- " return user_item_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 获取点击最多的topk个文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 获取近期点击最多的文章\n",
- "def get_item_topk_click(click_df, k):\n",
- " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
- " return topk_click"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## itemcf的物品相似度计算"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:51:07.577037Z",
- "start_time": "2020-11-16T07:51:07.568098Z"
- }
- },
- "outputs": [],
- "source": [
- "def itemcf_sim(df):\n",
- " \"\"\"\n",
- " 文章与文章之间的相似性矩阵计算\n",
- " :param df: 数据表\n",
- " :item_created_time_dict: 文章创建时间的字典\n",
- " return : 文章与文章的相似性矩阵\n",
- " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
- " \"\"\"\n",
- " \n",
- " user_item_time_dict = get_user_item_time(df)\n",
- " \n",
- " # 计算物品相似度\n",
- " i2i_sim = {}\n",
- " item_cnt = defaultdict(int)\n",
- " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
- " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
- " for i, i_click_time in item_time_list:\n",
- " item_cnt[i] += 1\n",
- " i2i_sim.setdefault(i, {})\n",
- " for j, j_click_time in item_time_list:\n",
- " if(i == j):\n",
- " continue\n",
- " i2i_sim[i].setdefault(j, 0)\n",
- " \n",
- " i2i_sim[i][j] += 1 / math.log(len(item_time_list) + 1)\n",
- " \n",
- " i2i_sim_ = i2i_sim.copy()\n",
- " for i, related_items in i2i_sim.items():\n",
- " for j, wij in related_items.items():\n",
- " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
- " \n",
- " # 将得到的相似性矩阵保存到本地\n",
- " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
- " \n",
- " return i2i_sim_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:53:10.038470Z",
- "start_time": "2020-11-16T07:51:11.281176Z"
- }
- },
- "outputs": [
+ "cells": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [00:23<00:00, 10802.38it/s]\n"
- ]
- }
- ],
- "source": [
- "i2i_sim = itemcf_sim(all_click_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## itemcf 的文章推荐"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T08:03:18.383215Z",
- "start_time": "2020-11-16T08:03:18.373432Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于商品的召回i2i\n",
- "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click):\n",
- " \"\"\"\n",
- " 基于文章协同过滤的召回\n",
- " :param user_id: 用户id\n",
- " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- " :param i2i_sim: 字典,文章相似性矩阵\n",
- " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
- " :param recall_item_num: 整数, 最后的召回文章数量\n",
- " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全 \n",
- " return: 召回的文章列表 {item1:score1, item2: score2...}\n",
- " 注意: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
- " \"\"\"\n",
- " \n",
- " # 获取用户历史交互的文章\n",
- " user_hist_items = user_item_time_dict[user_id]\n",
- " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
- " \n",
- " item_rank = {}\n",
- " for loc, (i, click_time) in enumerate(user_hist_items):\n",
- " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
- " if j in user_hist_items_:\n",
- " continue\n",
- " \n",
- " item_rank.setdefault(j, 0)\n",
- " item_rank[j] += wij\n",
- " \n",
- " # 不足10个,用热门商品补全\n",
- " if len(item_rank) < recall_item_num:\n",
- " for i, item in enumerate(item_topk_click):\n",
- " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
- " continue\n",
- " item_rank[item] = - i - 100 # 随便给个负数就行\n",
- " if len(item_rank) == recall_item_num:\n",
- " break\n",
- " \n",
- " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
- " \n",
- " return item_rank"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 给每个用户根据物品的协同过滤推荐文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:15:01.109798Z",
- "start_time": "2020-11-16T08:11:07.233787Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 赛题理解\n",
+ "赛题理解是切入一道赛题的基础,会影响后续特征工程和模型构建等各种工作,也影响着后续发展工作的方向,正确了解赛题背后的思想以及赛题业务逻辑的清晰,有利于花费更少时间构建更为有效的特征模型, 在各种比赛中, 赛题理解都是极其重要且必须走好的第一步, 今天我们就从赛题的理解出发, 首先了解一下这次赛题的概况和数据,从中分析赛题以及大致的处理方式, 其次我们了解模型评测的指标,最后对赛题的理解整理一些经验。\n",
+ "\n",
+ "## 赛题简介\n",
+ "此次比赛是新闻推荐场景下的用户行为预测挑战赛, 该赛题是以新闻APP中的新闻推荐为背景, 目的是**要求我们根据用户历史浏览点击新闻文章的数据信息预测用户未来的点击行为, 即用户的最后一次点击的新闻文章**, 这道赛题的设计初衷是引导大家了解推荐系统中的一些业务背景, 解决实际问题。 \n",
+ "\n",
+ "## 数据概况\n",
+ "该数据来自某新闻APP平台的用户交互数据,包括30万用户,近300万次点击,共36万多篇不同的新闻文章,同时每篇新闻文章有对应的embedding向量表示。为了保证比赛的公平性,从中抽取20万用户的点击日志数据作为训练集,5万用户的点击日志数据作为测试集A,5万用户的点击日志数据作为测试集B。具体数据表和参数, 大家可以参考赛题说明。下面说一下拿到这样的数据如何进行理解, 来有效的开展下一步的工作。 \n",
+ "## 评价方式理解\n",
+ "理解评价方式, 我们需要结合着最后的提交文件来看, 根据sample.submit.csv, 我们最后提交的格式是针对每个用户, 我们都会给出五篇文章的推荐结果,按照点击概率从前往后排序。 而真实的每个用户最后一次点击的文章只会有一篇的真实答案, 所以我们就看我们推荐的这五篇里面是否有命中真实答案的。比如对于user1来说, 我们的提交会是:\n",
+ ">user1, article1, article2, article3, article4, article5.\n",
+ "\n",
+ "评价指标的公式如下:\n",
+ "$$\n",
+ "score(user) = \\sum_{k=1}^5 \\frac{s(user, k)}{k}\n",
+ "$$\n",
+ "\n",
+ "假如article1就是真实的用户点击文章,也就是article1命中, 则s(user1,1)=1, s(user1,2-4)都是0, 如果article2是用户点击的文章, 则s(user,2)=1/2,s(user,1,3,4,5)都是0。也就是score(user)=命中第几条的倒数。如果都没中, 则score(user1)=0。 这个是合理的, 因为我们希望的就是命中的结果尽量靠前, 而此时分数正好比较高。\n",
+ "\n",
+ "## 赛题理解\n",
+ "根据赛题简介,我们首先要明确我们此次比赛的目标: 根据用户历史浏览点击新闻的数据信息预测用户最后一次点击的新闻文章。从这个目标上看, 会发现此次比赛和我们之前遇到的普通的结构化比赛不太一样, 主要有两点:\n",
+ " \n",
+ "- 首先是目标上, 要预测最后一次点击的新闻文章,也就是我们给用户推荐的是新闻文章, 并不是像之前那种预测一个数或者预测数据哪一类那样的问题\n",
+ "- 数据上, 通过给出的数据我们会发现, 这种数据也不是我们之前遇到的那种特征+标签的数据,而是基于了真实的业务场景, 拿到的用户的点击日志\n",
+ "\n",
+ "所以拿到这个题目,我们的思考方向就是结合我们的目标,**把该预测问题转成一个监督学习的问题(特征+标签),然后我们才能进行ML,DL等建模预测**。那么我们自然而然的就应该在心里会有这么几个问题:如何转成一个监督学习问题呢? 转成一个什么样的监督学习问题呢? 我们能利用的特征又有哪些呢? 又有哪些模型可以尝试呢? 此次面对数万级别的文章推荐,我们又有哪些策略呢? \n",
+ "\n",
+ "当然这些问题不会在我们刚看到赛题之后就一下出来答案, 但是只要有了问题之后, 我们就能想办法解决问题了, 比如上面的第二个问题,转成一个什么样的监督学习问题? 由于我们是预测用户最后一次点击的新闻文章,从36万篇文章中预测某一篇的话我们首先可能会想到这可能是一个多分类的问题(36万类里面选1), 但是如此庞大的分类问题, 我们做起来可能比较困难, 那么能不能转化一下? 既然是要预测最后一次点击的文章, 那么如果我们能预测出某个用户最后一次对于某一篇文章会进行点击的概率, 是不是就间接性的解决了这个问题呢?概率最大的那篇文章不就是用户最后一次可能点击的新闻文章吗? 这样就把原问题变成了一个点击率预测的问题(用户, 文章) --> 点击的概率(软分类), 而这个问题, 就是我们所熟悉的监督学习领域分类问题了, 这样我们后面建模的时候, 对于模型的选择就基本上有大致方向了,比如最简单的逻辑回归模型。 \n",
+ "这样, 我们对于该赛题的解决方案应该有了一个大致的解决思路,要先转成一个分类问题来做, 而分类的标签就是用户是否会点击某篇文章,分类问题的特征中会有用户和文章,我们要训练一个分类模型, 对某用户最后一次点击某篇文章的概率进行预测。 那么又会有几个问题:如何转成监督学习问题? 训练集和测试集怎么制作? 我们又能利用哪些特征? 我们又可以尝试哪些模型? 面对36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模?如何进行最后的预测? "
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [43:19<00:00, 96.18it/s] \n"
- ]
- }
- ],
- "source": [
- "# 定义\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "\n",
- "# 获取 用户 - 文章 - 点击时间的字典\n",
- "user_item_time_dict = get_user_item_time(all_click_df)\n",
- "\n",
- "# 去取文章相似度\n",
- "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
- "\n",
- "# 相似文章的数量\n",
- "sim_item_topk = 10\n",
- "\n",
- "# 召回文章数量\n",
- "recall_item_num = 10\n",
- "\n",
- "# 用户热度补全\n",
- "item_topk_click = get_item_topk_click(all_click_df, k=50)\n",
- "\n",
- "for user in tqdm(all_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, \n",
- " sim_item_topk, recall_item_num, item_topk_click)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 召回字典转换成df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:16:36.647466Z",
- "start_time": "2020-11-16T10:16:24.791219Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Baseline"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [00:04<00:00, 53319.08it/s]\n"
- ]
- }
- ],
- "source": [
- "# 将字典的形式转换成df\n",
- "user_item_score_list = []\n",
- "\n",
- "for user, items in tqdm(user_recall_items_dict.items()):\n",
- " for item, score in items:\n",
- " user_item_score_list.append([user, item, score])\n",
- "\n",
- "recall_df = pd.DataFrame(user_item_score_list, columns=['user_id', 'click_article_id', 'pred_score'])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 生成提交文件"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:16:46.268341Z",
- "start_time": "2020-11-16T10:16:46.259293Z"
- }
- },
- "outputs": [],
- "source": [
- "# 生成提交文件\n",
- "def submit(recall_df, topk=5, model_name=None):\n",
- " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
- " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 判断是不是每个用户都有5篇文章及以上\n",
- " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
- " assert tmp.min() >= topk\n",
- " \n",
- " del recall_df['pred_score']\n",
- " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
- " \n",
- " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
- " # 按照提交格式定义列名\n",
- " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
- " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
- " \n",
- " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
- " submit.to_csv(save_name, index=False, header=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:17:42.254328Z",
- "start_time": "2020-11-16T10:17:32.211862Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 导包"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:46:49.678700Z",
+ "start_time": "2020-11-16T07:46:49.673336Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# import packages\n",
+ "import time, math, os\n",
+ "from tqdm import tqdm\n",
+ "import gc\n",
+ "import pickle\n",
+ "import random\n",
+ "from datetime import datetime\n",
+ "from operator import itemgetter\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import warnings\n",
+ "from collections import defaultdict\n",
+ "import collections\n",
+ "warnings.filterwarnings('ignore')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:48:34.240098Z",
+ "start_time": "2020-11-16T07:48:34.236370Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# data_path = './data_raw/'\n",
+ "data_path = '/home/admin/jupyter/data/' # 天池平台路径\n",
+ "save_path = '/home/admin/jupyter/temp_results/' # 天池平台路径"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## df节省内存函数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 节约内存的一个标配函数\n",
+ "def reduce_mem(df):\n",
+ " starttime = time.time()\n",
+ " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
+ " start_mem = df.memory_usage().sum() / 1024**2\n",
+ " for col in df.columns:\n",
+ " col_type = df[col].dtypes\n",
+ " if col_type in numerics:\n",
+ " c_min = df[col].min()\n",
+ " c_max = df[col].max()\n",
+ " if pd.isnull(c_min) or pd.isnull(c_max):\n",
+ " continue\n",
+ " if str(col_type)[:3] == 'int':\n",
+ " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
+ " df[col] = df[col].astype(np.int8)\n",
+ " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
+ " df[col] = df[col].astype(np.int16)\n",
+ " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
+ " df[col] = df[col].astype(np.int32)\n",
+ " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
+ " df[col] = df[col].astype(np.int64)\n",
+ " else:\n",
+ " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
+ " df[col] = df[col].astype(np.float16)\n",
+ " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
+ " df[col] = df[col].astype(np.float32)\n",
+ " else:\n",
+ " df[col] = df[col].astype(np.float64)\n",
+ " end_mem = df.memory_usage().sum() / 1024**2\n",
+ " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
+ " 100*(start_mem-end_mem)/start_mem,\n",
+ " (time.time()-starttime)/60))\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取采样或全量数据"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:48:50.619963Z",
+ "start_time": "2020-11-16T07:48:50.611667Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# debug模式:从训练集中划出一部分数据来调试代码\n",
+ "def get_all_click_sample(data_path, sample_nums=10000):\n",
+ " \"\"\"\n",
+ " 训练集中采样一部分数据调试\n",
+ " data_path: 原数据的存储路径\n",
+ " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
+ " \"\"\"\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " all_user_ids = all_click.user_id.unique()\n",
+ "\n",
+ " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
+ " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click\n",
+ "\n",
+ "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
+ "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
+ "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
+ " if offline:\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " else:\n",
+ " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
+ "\n",
+ " all_click = trn_click.append(tst_click)\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 全量训练集\n",
+ "all_click_df = get_all_click_df(data_path, offline=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 获取 用户 - 文章 - 点击时间字典"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:56:39.800240Z",
+ "start_time": "2020-11-16T07:56:39.793541Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ "def get_user_item_time(click_df):\n",
+ " \n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " \n",
+ " def make_item_time_pair(df):\n",
+ " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
+ " \n",
+ " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
+ " .reset_index().rename(columns={0: 'item_time_list'})\n",
+ " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
+ " \n",
+ " return user_item_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 获取点击最多的topk个文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 获取近期点击最多的文章\n",
+ "def get_item_topk_click(click_df, k):\n",
+ " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
+ " return topk_click"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## itemcf的物品相似度计算"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:51:07.577037Z",
+ "start_time": "2020-11-16T07:51:07.568098Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def itemcf_sim(df):\n",
+ " \"\"\"\n",
+ " 文章与文章之间的相似性矩阵计算\n",
+ " :param df: 数据表\n",
+ " :item_created_time_dict: 文章创建时间的字典\n",
+ " return : 文章与文章的相似性矩阵\n",
+ " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
+ " \"\"\"\n",
+ " \n",
+ " user_item_time_dict = get_user_item_time(df)\n",
+ " \n",
+ " # 计算物品相似度\n",
+ " i2i_sim = {}\n",
+ " item_cnt = defaultdict(int)\n",
+ " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
+ " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
+ " for i, i_click_time in item_time_list:\n",
+ " item_cnt[i] += 1\n",
+ " i2i_sim.setdefault(i, {})\n",
+ " for j, j_click_time in item_time_list:\n",
+ " if(i == j):\n",
+ " continue\n",
+ " i2i_sim[i].setdefault(j, 0)\n",
+ " \n",
+ " i2i_sim[i][j] += 1 / math.log(len(item_time_list) + 1)\n",
+ " \n",
+ " i2i_sim_ = i2i_sim.copy()\n",
+ " for i, related_items in i2i_sim.items():\n",
+ " for j, wij in related_items.items():\n",
+ " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
+ " \n",
+ " # 将得到的相似性矩阵保存到本地\n",
+ " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
+ " \n",
+ " return i2i_sim_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:53:10.038470Z",
+ "start_time": "2020-11-16T07:51:11.281176Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [00:23<00:00, 10802.38it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "i2i_sim = itemcf_sim(all_click_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## itemcf 的文章推荐"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T08:03:18.383215Z",
+ "start_time": "2020-11-16T08:03:18.373432Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于商品的召回i2i\n",
+ "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click):\n",
+ " \"\"\"\n",
+ " 基于文章协同过滤的召回\n",
+ " :param user_id: 用户id\n",
+ " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ " :param i2i_sim: 字典,文章相似性矩阵\n",
+ " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
+ " :param recall_item_num: 整数, 最后的召回文章数量\n",
+ " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全 \n",
+ " return: 召回的文章列表 {item1:score1, item2: score2...}\n",
+ " 注意: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习), 在多路召回部分会加上关联规则的召回策略\n",
+ " \"\"\"\n",
+ " \n",
+ " # 获取用户历史交互的文章\n",
+ " user_hist_items = user_item_time_dict[user_id]\n",
+ " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
+ " \n",
+ " item_rank = {}\n",
+ " for loc, (i, click_time) in enumerate(user_hist_items):\n",
+ " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
+ " if j in user_hist_items_:\n",
+ " continue\n",
+ " \n",
+ " item_rank.setdefault(j, 0)\n",
+ " item_rank[j] += wij\n",
+ " \n",
+ " # 不足10个,用热门商品补全\n",
+ " if len(item_rank) < recall_item_num:\n",
+ " for i, item in enumerate(item_topk_click):\n",
+ " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
+ " continue\n",
+ " item_rank[item] = - i - 100 # 随便给个负数就行\n",
+ " if len(item_rank) == recall_item_num:\n",
+ " break\n",
+ " \n",
+ " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
+ " \n",
+ " return item_rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 给每个用户根据物品的协同过滤推荐文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:15:01.109798Z",
+ "start_time": "2020-11-16T08:11:07.233787Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [43:19<00:00, 96.18it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 定义\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "\n",
+ "# 获取 用户 - 文章 - 点击时间的字典\n",
+ "user_item_time_dict = get_user_item_time(all_click_df)\n",
+ "\n",
+ "# 去取文章相似度\n",
+ "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
+ "\n",
+ "# 相似文章的数量\n",
+ "sim_item_topk = 10\n",
+ "\n",
+ "# 召回文章数量\n",
+ "recall_item_num = 10\n",
+ "\n",
+ "# 用户热度补全\n",
+ "item_topk_click = get_item_topk_click(all_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(all_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, \n",
+ " sim_item_topk, recall_item_num, item_topk_click)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 召回字典转换成df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:16:36.647466Z",
+ "start_time": "2020-11-16T10:16:24.791219Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [00:04<00:00, 53319.08it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 将字典的形式转换成df\n",
+ "user_item_score_list = []\n",
+ "\n",
+ "for user, items in tqdm(user_recall_items_dict.items()):\n",
+ " for item, score in items:\n",
+ " user_item_score_list.append([user, item, score])\n",
+ "\n",
+ "recall_df = pd.DataFrame(user_item_score_list, columns=['user_id', 'click_article_id', 'pred_score'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 生成提交文件"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:16:46.268341Z",
+ "start_time": "2020-11-16T10:16:46.259293Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 生成提交文件\n",
+ "def submit(recall_df, topk=5, model_name=None):\n",
+ " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
+ " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 判断是不是每个用户都有5篇文章及以上\n",
+ " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
+ " assert tmp.min() >= topk\n",
+ " \n",
+ " del recall_df['pred_score']\n",
+ " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
+ " \n",
+ " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
+ " # 按照提交格式定义列名\n",
+ " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
+ " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
+ " \n",
+ " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
+ " submit.to_csv(save_name, index=False, header=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:17:42.254328Z",
+ "start_time": "2020-11-16T10:17:32.211862Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取测试集\n",
+ "tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
+ "tst_users = tst_click['user_id'].unique()\n",
+ "\n",
+ "# 从所有的召回数据中将测试集中的用户选出来\n",
+ "tst_recall = recall_df[recall_df['user_id'].isin(tst_users)]\n",
+ "\n",
+ "# 生成提交文件\n",
+ "submit(tst_recall, topk=5, model_name='itemcf_baseline')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 总结\n",
+ "本节内容主要包括赛题简介,数据概况,评价方式以及对该赛题进行了一个总体上的思路分析,作为竞赛前的预热,旨在帮助学习者们能够更好切入该赛题,为后面的学习内容打下一个良好的基础。最后我们给出了关于本赛题的一个简易Baseline, 帮助学习者们先了解一下新闻推荐比赛的一个整理流程, 接下来我们就对于流程中的每个步骤进行详细的介绍。\n",
+ "\n",
+ "今天的学习比较简单,下面整理一下关于赛题理解的一些经验:\n",
+ "\n",
+ "* 赛题理解究竟是在理解什么? \n",
+ "\n",
+ ">**理解赛题**:从直观上对问题进行梳理, 分析问题的目标,到底要让做什么事情, **这个非常重要**\n",
+ ">\n",
+ ">**理解数据**:对赛题数据有一个初步了解,知道和任务相关的数据字段和数据字段的类型, 数据之间的内在关联等,大体梳理一下哪些数据会对我们解决问题非常有用,方便后面我们的数据分析和特征工程。\n",
+ ">\n",
+ ">**理解评估指标**:评估指标是检验我们提出的方法,我们给出结果好坏的标准,只有正确的理解了评估指标,我们才能进行更好的训练模型,更好的进行预测。此外,很多情况下,线上验证是有一定的时间和次数限制的,**所以在比赛中构建一个合理的本地的验证集和验证的评价指标是很关键的步骤,能有效的节省很多时间**。 不同的指标对于同样的预测结果是具有误差敏感的差异性的所以不同的评价指标会影响后续一些预测的侧重点。\n",
+ "\n",
+ "* 有了赛题理解之后,我们该做什么?\n",
+ "\n",
+ " >在对于赛题有了一定的了解后,分析清楚了问题的类型性质和对于数据理解 的这一基础上,我们可以梳理一个解决赛题的一个大题思路和框架\n",
+ " >\n",
+ " >我们至少要有一些相应的理解分析,比如**这题的难点可能在哪里,关键点可能在哪里,哪些地方可以挖掘更好的特征**.\n",
+ " >\n",
+ " >用什么样得线下验证方式更为稳定,**出现了过拟合或者其他问题,估摸可以用什么方法去解决这些问题**\n",
+ "\n",
+ " 这时是在一个宏观的大体下分析的,有助于摸清整个题的思路脉络,以及后续的分析方向\n",
+ "\n",
+ "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- },
- "outputs": [],
- "source": [
- "# 获取测试集\n",
- "tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
- "tst_users = tst_click['user_id'].unique()\n",
- "\n",
- "# 从所有的召回数据中将测试集中的用户选出来\n",
- "tst_recall = recall_df[recall_df['user_id'].isin(tst_users)]\n",
- "\n",
- "# 生成提交文件\n",
- "submit(tst_recall, topk=5, model_name='itemcf_baseline')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 总结\n",
- "本节内容主要包括赛题简介,数据概况,评价方式以及对该赛题进行了一个总体上的思路分析,作为竞赛前的预热,旨在帮助学习者们能够更好切入该赛题,为后面的学习内容打下一个良好的基础。最后我们给出了关于本赛题的一个简易Baseline, 帮助学习者们先了解一下新闻推荐比赛的一个整理流程, 接下来我们就对于流程中的每个步骤进行详细的介绍。\n",
- "\n",
- "今天的学习比较简单,下面整理一下关于赛题理解的一些经验:\n",
- "\n",
- "* 赛题理解究竟是在理解什么? \n",
- "\n",
- ">**理解赛题**:从直观上对问题进行梳理, 分析问题的目标,到底要让做什么事情, **这个非常重要**\n",
- ">\n",
- ">**理解数据**:对赛题数据有一个初步了解,知道和任务相关的数据字段和数据字段的类型, 数据之间的内在关联等,大体梳理一下哪些数据会对我们解决问题非常有用,方便后面我们的数据分析和特征工程。\n",
- ">\n",
- ">**理解评估指标**:评估指标是检验我们提出的方法,我们给出结果好坏的标准,只有正确的理解了评估指标,我们才能进行更好的训练模型,更好的进行预测。此外,很多情况下,线上验证是有一定的时间和次数限制的,**所以在比赛中构建一个合理的本地的验证集和验证的评价指标是很关键的步骤,能有效的节省很多时间**。 不同的指标对于同样的预测结果是具有误差敏感的差异性的所以不同的评价指标会影响后续一些预测的侧重点。\n",
- "\n",
- "* 有了赛题理解之后,我们该做什么?\n",
- "\n",
- " >在对于赛题有了一定的了解后,分析清楚了问题的类型性质和对于数据理解 的这一基础上,我们可以梳理一个解决赛题的一个大题思路和框架\n",
- " >\n",
- " >我们至少要有一些相应的理解分析,比如**这题的难点可能在哪里,关键点可能在哪里,哪些地方可以挖掘更好的特征**.\n",
- " >\n",
- " >用什么样得线下验证方式更为稳定,**出现了过拟合或者其他问题,估摸可以用什么方法去解决这些问题**\n",
- "\n",
- " 这时是在一个宏观的大体下分析的,有助于摸清整个题的思路脉络,以及后续的分析方向\n",
- "\n",
- "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.3"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [],
- "description": "",
- "notebookId": "130006",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "170px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
},
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.3"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [],
+ "description": "",
+ "notebookId": "130006",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "170px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.ipynb" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.ipynb"
index c9cbc0c37..6bc2d7d2b 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.ipynb"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.ipynb"
@@ -1,3980 +1,3980 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 数据分析\n",
- "\n",
- "数据分析的价值主要在于熟悉了解整个数据集的基本情况包括每个文件里有哪些数据,具体的文件中的每个字段表示什么实际含义,以及数据集中特征之间的相关性,在推荐场景下主要就是分析用户本身的基本属性,文章基本属性,以及用户和文章交互的一些分布,这些都有利于后面的召回策略的选择,以及特征工程。\n",
- "\n",
- "**建议:当特征工程和模型调参已经很难继续上分了,可以回来在重新从新的角度去分析这些数据,或许可以找到上分的灵感**\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:13:59.322486Z",
- "start_time": "2020-11-13T15:13:55.601445Z"
- }
- },
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "import pandas as pd\n",
- "import numpy as np\n",
- "\n",
- "import matplotlib.pyplot as plt\n",
- "import seaborn as sns\n",
- "plt.rc('font', family='SimHei', size=13)\n",
- "\n",
- "import os,gc,re,warnings,sys\n",
- "warnings.filterwarnings(\"ignore\")"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:14:18.918041Z",
- "start_time": "2020-11-13T15:14:02.568798Z"
- }
- },
- "outputs": [],
- "source": [
- "# path = './data/' # 自定义的路径\n",
- "path = './' # 天池平台路径\n",
- "\n",
- "#####train\n",
- "trn_click = pd.read_csv(path+'train_click_log.csv')\n",
- "#trn_click = pd.read_csv(path+'train_click_log.csv', names=['user_id','item_id','click_time','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type'])\n",
- "item_df = pd.read_csv(path+'articles.csv')\n",
- "item_df = item_df.rename(columns={'article_id': 'click_article_id'}) #重命名,方便后续match\n",
- "item_emb_df = pd.read_csv(path+'articles_emb.csv')\n",
- "\n",
- "#####test\n",
- "tst_click = pd.read_csv(path+'testA_click_log.csv')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 数据预处理\n",
- "计算用户点击rank和点击次数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:14:31.746748Z",
- "start_time": "2020-11-13T15:14:31.409643Z"
- }
- },
- "outputs": [],
- "source": [
- "# 对每个用户的点击时间戳进行排序\n",
- "trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)\n",
- "tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:15:04.503079Z",
- "start_time": "2020-11-13T15:15:04.394329Z"
- }
- },
- "outputs": [],
- "source": [
- "#计算用户点击文章的次数,并添加新的一列count\n",
- "trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')\n",
- "tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 数据浏览"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击日志文件_训练集"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:16:07.764776Z",
- "start_time": "2020-11-13T15:16:07.536342Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 199999 \n",
- " 160417 \n",
- " 1507029570190 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 1 \n",
- " 11 \n",
- " 11 \n",
- " 281 \n",
- " 1506942089000 \n",
- " 173 \n",
- " \n",
- " \n",
- " 1 \n",
- " 199999 \n",
- " 5408 \n",
- " 1507029571478 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 1 \n",
- " 10 \n",
- " 11 \n",
- " 4 \n",
- " 1506994257000 \n",
- " 118 \n",
- " \n",
- " \n",
- " 2 \n",
- " 199999 \n",
- " 50823 \n",
- " 1507029601478 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 1 \n",
- " 9 \n",
- " 11 \n",
- " 99 \n",
- " 1507013614000 \n",
- " 213 \n",
- " \n",
- " \n",
- " 3 \n",
- " 199998 \n",
- " 157770 \n",
- " 1507029532200 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 25 \n",
- " 5 \n",
- " 40 \n",
- " 40 \n",
- " 281 \n",
- " 1506983935000 \n",
- " 201 \n",
- " \n",
- " \n",
- " 4 \n",
- " 199998 \n",
- " 96613 \n",
- " 1507029671831 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 25 \n",
- " 5 \n",
- " 39 \n",
- " 40 \n",
- " 209 \n",
- " 1506938444000 \n",
- " 185 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 数据分析\n",
+ "\n",
+ "数据分析的价值主要在于熟悉了解整个数据集的基本情况包括每个文件里有哪些数据,具体的文件中的每个字段表示什么实际含义,以及数据集中特征之间的相关性,在推荐场景下主要就是分析用户本身的基本属性,文章基本属性,以及用户和文章交互的一些分布,这些都有利于后面的召回策略的选择,以及特征工程。\n",
+ "\n",
+ "**建议:当特征工程和模型调参已经很难继续上分了,可以回来在重新从新的角度去分析这些数据,或许可以找到上分的灵感**\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 导包"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:13:59.322486Z",
+ "start_time": "2020-11-13T15:13:55.601445Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%matplotlib inline\n",
+ "import pandas as pd\n",
+ "import numpy as np\n",
+ "\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "plt.rc('font', family='SimHei', size=13)\n",
+ "\n",
+ "import os,gc,re,warnings,sys\n",
+ "warnings.filterwarnings(\"ignore\")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取数据"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:14:18.918041Z",
+ "start_time": "2020-11-13T15:14:02.568798Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# path = './data/' # 自定义的路径\n",
+ "path = './' # 天池平台路径\n",
+ "\n",
+ "#####train\n",
+ "trn_click = pd.read_csv(path+'train_click_log.csv')\n",
+ "#trn_click = pd.read_csv(path+'train_click_log.csv', names=['user_id','item_id','click_time','click_environment','click_deviceGroup','click_os','click_country','click_region','click_referrer_type'])\n",
+ "item_df = pd.read_csv(path+'articles.csv')\n",
+ "item_df = item_df.rename(columns={'article_id': 'click_article_id'}) #重命名,方便后续match\n",
+ "item_emb_df = pd.read_csv(path+'articles_emb.csv')\n",
+ "\n",
+ "#####test\n",
+ "tst_click = pd.read_csv(path+'testA_click_log.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 数据预处理\n",
+ "计算用户点击rank和点击次数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:14:31.746748Z",
+ "start_time": "2020-11-13T15:14:31.409643Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 对每个用户的点击时间戳进行排序\n",
+ "trn_click['rank'] = trn_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)\n",
+ "tst_click['rank'] = tst_click.groupby(['user_id'])['click_timestamp'].rank(ascending=False).astype(int)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:15:04.503079Z",
+ "start_time": "2020-11-13T15:15:04.394329Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "#计算用户点击文章的次数,并添加新的一列count\n",
+ "trn_click['click_cnts'] = trn_click.groupby(['user_id'])['click_timestamp'].transform('count')\n",
+ "tst_click['click_cnts'] = tst_click.groupby(['user_id'])['click_timestamp'].transform('count')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 数据浏览"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击日志文件_训练集"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:16:07.764776Z",
+ "start_time": "2020-11-13T15:16:07.536342Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 199999 \n",
+ " 160417 \n",
+ " 1507029570190 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 1 \n",
+ " 11 \n",
+ " 11 \n",
+ " 281 \n",
+ " 1506942089000 \n",
+ " 173 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 199999 \n",
+ " 5408 \n",
+ " 1507029571478 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 1 \n",
+ " 10 \n",
+ " 11 \n",
+ " 4 \n",
+ " 1506994257000 \n",
+ " 118 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 199999 \n",
+ " 50823 \n",
+ " 1507029601478 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 1 \n",
+ " 9 \n",
+ " 11 \n",
+ " 99 \n",
+ " 1507013614000 \n",
+ " 213 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 199998 \n",
+ " 157770 \n",
+ " 1507029532200 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 25 \n",
+ " 5 \n",
+ " 40 \n",
+ " 40 \n",
+ " 281 \n",
+ " 1506983935000 \n",
+ " 201 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 199998 \n",
+ " 96613 \n",
+ " 1507029671831 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 25 \n",
+ " 5 \n",
+ " 39 \n",
+ " 40 \n",
+ " 209 \n",
+ " 1506938444000 \n",
+ " 185 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "0 199999 160417 1507029570190 4 \n",
+ "1 199999 5408 1507029571478 4 \n",
+ "2 199999 50823 1507029601478 4 \n",
+ "3 199998 157770 1507029532200 4 \n",
+ "4 199998 96613 1507029671831 4 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "0 1 17 1 13 \n",
+ "1 1 17 1 13 \n",
+ "2 1 17 1 13 \n",
+ "3 1 17 1 25 \n",
+ "4 1 17 1 25 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
+ "0 1 11 11 281 1506942089000 \n",
+ "1 1 10 11 4 1506994257000 \n",
+ "2 1 9 11 99 1507013614000 \n",
+ "3 5 40 40 281 1506983935000 \n",
+ "4 5 39 40 209 1506938444000 \n",
+ "\n",
+ " words_count \n",
+ "0 173 \n",
+ "1 118 \n",
+ "2 213 \n",
+ "3 201 \n",
+ "4 185 "
+ ]
+ },
+ "execution_count": 7,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "0 199999 160417 1507029570190 4 \n",
- "1 199999 5408 1507029571478 4 \n",
- "2 199999 50823 1507029601478 4 \n",
- "3 199998 157770 1507029532200 4 \n",
- "4 199998 96613 1507029671831 4 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "0 1 17 1 13 \n",
- "1 1 17 1 13 \n",
- "2 1 17 1 13 \n",
- "3 1 17 1 25 \n",
- "4 1 17 1 25 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
- "0 1 11 11 281 1506942089000 \n",
- "1 1 10 11 4 1506994257000 \n",
- "2 1 9 11 99 1507013614000 \n",
- "3 5 40 40 281 1506983935000 \n",
- "4 5 39 40 209 1506938444000 \n",
- "\n",
- " words_count \n",
- "0 173 \n",
- "1 118 \n",
- "2 213 \n",
- "3 201 \n",
- "4 185 "
- ]
- },
- "execution_count": 7,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])\n",
- "trn_click.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### train_click_log.csv文件数据中每个字段的含义\n",
- "\n",
- "1. user_id: 用户的唯一标识\n",
- "2. click_article_id: 用户点击的文章唯一标识\n",
- "3. click_timestamp: 用户点击文章时的时间戳\n",
- "4. click_environment: 用户点击文章的环境\n",
- "5. click_deviceGroup: 用户点击文章的设备组\n",
- "6. click_os: 用户点击文章时的操作系统\n",
- "7. click_country: 用户点击文章时的所在的国家\n",
- "8. click_region: 用户点击文章时所在的区域\n",
- "9. click_referrer_type: 用户点击文章时,文章的来源"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:16:18.536902Z",
- "start_time": "2020-11-13T15:16:18.424203Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "\n",
- "Int64Index: 1112623 entries, 0 to 1112622\n",
- "Data columns (total 14 columns):\n",
- "user_id 1112623 non-null int64\n",
- "click_article_id 1112623 non-null int64\n",
- "click_timestamp 1112623 non-null int64\n",
- "click_environment 1112623 non-null int64\n",
- "click_deviceGroup 1112623 non-null int64\n",
- "click_os 1112623 non-null int64\n",
- "click_country 1112623 non-null int64\n",
- "click_region 1112623 non-null int64\n",
- "click_referrer_type 1112623 non-null int64\n",
- "rank 1112623 non-null int64\n",
- "click_cnts 1112623 non-null int64\n",
- "category_id 1112623 non-null int64\n",
- "created_at_ts 1112623 non-null int64\n",
- "words_count 1112623 non-null int64\n",
- "dtypes: int64(14)\n",
- "memory usage: 127.3 MB\n"
- ]
- }
- ],
- "source": [
- "#用户点击日志信息\n",
- "trn_click.info()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " 1.112623e+06 \n",
- " \n",
- " \n",
- " mean \n",
- " 1.221198e+05 \n",
- " 1.951541e+05 \n",
- " 1.507588e+12 \n",
- " 3.947786e+00 \n",
- " 1.815981e+00 \n",
- " 1.301976e+01 \n",
- " 1.310776e+00 \n",
- " 1.813587e+01 \n",
- " 1.910063e+00 \n",
- " 7.118518e+00 \n",
- " 1.323704e+01 \n",
- " 3.056176e+02 \n",
- " 1.506598e+12 \n",
- " 2.011981e+02 \n",
- " \n",
- " \n",
- " std \n",
- " 5.540349e+04 \n",
- " 9.292286e+04 \n",
- " 3.363466e+08 \n",
- " 3.276715e-01 \n",
- " 1.035170e+00 \n",
- " 6.967844e+00 \n",
- " 1.618264e+00 \n",
- " 7.105832e+00 \n",
- " 1.220012e+00 \n",
- " 1.016095e+01 \n",
- " 1.631503e+01 \n",
- " 1.155791e+02 \n",
- " 8.343066e+09 \n",
- " 5.223881e+01 \n",
- " \n",
- " \n",
- " min \n",
- " 0.000000e+00 \n",
- " 3.000000e+00 \n",
- " 1.507030e+12 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.166573e+12 \n",
- " 0.000000e+00 \n",
- " \n",
- " \n",
- " 25% \n",
- " 7.934700e+04 \n",
- " 1.239090e+05 \n",
- " 1.507297e+12 \n",
- " 4.000000e+00 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.300000e+01 \n",
- " 1.000000e+00 \n",
- " 2.000000e+00 \n",
- " 4.000000e+00 \n",
- " 2.500000e+02 \n",
- " 1.507220e+12 \n",
- " 1.700000e+02 \n",
- " \n",
- " \n",
- " 50% \n",
- " 1.309670e+05 \n",
- " 2.038900e+05 \n",
- " 1.507596e+12 \n",
- " 4.000000e+00 \n",
- " 1.000000e+00 \n",
- " 1.700000e+01 \n",
- " 1.000000e+00 \n",
- " 2.100000e+01 \n",
- " 2.000000e+00 \n",
- " 4.000000e+00 \n",
- " 8.000000e+00 \n",
- " 3.280000e+02 \n",
- " 1.507553e+12 \n",
- " 1.970000e+02 \n",
- " \n",
- " \n",
- " 75% \n",
- " 1.704010e+05 \n",
- " 2.777120e+05 \n",
- " 1.507841e+12 \n",
- " 4.000000e+00 \n",
- " 3.000000e+00 \n",
- " 1.700000e+01 \n",
- " 1.000000e+00 \n",
- " 2.500000e+01 \n",
- " 2.000000e+00 \n",
- " 8.000000e+00 \n",
- " 1.600000e+01 \n",
- " 4.100000e+02 \n",
- " 1.507756e+12 \n",
- " 2.280000e+02 \n",
- " \n",
- " \n",
- " max \n",
- " 1.999990e+05 \n",
- " 3.640460e+05 \n",
- " 1.510603e+12 \n",
- " 4.000000e+00 \n",
- " 5.000000e+00 \n",
- " 2.000000e+01 \n",
- " 1.100000e+01 \n",
- " 2.800000e+01 \n",
- " 7.000000e+00 \n",
- " 2.410000e+02 \n",
- " 2.410000e+02 \n",
- " 4.600000e+02 \n",
- " 1.510666e+12 \n",
- " 6.690000e+03 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])\n",
+ "trn_click.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### train_click_log.csv文件数据中每个字段的含义\n",
+ "\n",
+ "1. user_id: 用户的唯一标识\n",
+ "2. click_article_id: 用户点击的文章唯一标识\n",
+ "3. click_timestamp: 用户点击文章时的时间戳\n",
+ "4. click_environment: 用户点击文章的环境\n",
+ "5. click_deviceGroup: 用户点击文章的设备组\n",
+ "6. click_os: 用户点击文章时的操作系统\n",
+ "7. click_country: 用户点击文章时的所在的国家\n",
+ "8. click_region: 用户点击文章时所在的区域\n",
+ "9. click_referrer_type: 用户点击文章时,文章的来源"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:16:18.536902Z",
+ "start_time": "2020-11-13T15:16:18.424203Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "Int64Index: 1112623 entries, 0 to 1112622\n",
+ "Data columns (total 14 columns):\n",
+ "user_id 1112623 non-null int64\n",
+ "click_article_id 1112623 non-null int64\n",
+ "click_timestamp 1112623 non-null int64\n",
+ "click_environment 1112623 non-null int64\n",
+ "click_deviceGroup 1112623 non-null int64\n",
+ "click_os 1112623 non-null int64\n",
+ "click_country 1112623 non-null int64\n",
+ "click_region 1112623 non-null int64\n",
+ "click_referrer_type 1112623 non-null int64\n",
+ "rank 1112623 non-null int64\n",
+ "click_cnts 1112623 non-null int64\n",
+ "category_id 1112623 non-null int64\n",
+ "created_at_ts 1112623 non-null int64\n",
+ "words_count 1112623 non-null int64\n",
+ "dtypes: int64(14)\n",
+ "memory usage: 127.3 MB\n"
+ ]
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
- "mean 1.221198e+05 1.951541e+05 1.507588e+12 3.947786e+00 \n",
- "std 5.540349e+04 9.292286e+04 3.363466e+08 3.276715e-01 \n",
- "min 0.000000e+00 3.000000e+00 1.507030e+12 1.000000e+00 \n",
- "25% 7.934700e+04 1.239090e+05 1.507297e+12 4.000000e+00 \n",
- "50% 1.309670e+05 2.038900e+05 1.507596e+12 4.000000e+00 \n",
- "75% 1.704010e+05 2.777120e+05 1.507841e+12 4.000000e+00 \n",
- "max 1.999990e+05 3.640460e+05 1.510603e+12 4.000000e+00 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
- "mean 1.815981e+00 1.301976e+01 1.310776e+00 1.813587e+01 \n",
- "std 1.035170e+00 6.967844e+00 1.618264e+00 7.105832e+00 \n",
- "min 1.000000e+00 2.000000e+00 1.000000e+00 1.000000e+00 \n",
- "25% 1.000000e+00 2.000000e+00 1.000000e+00 1.300000e+01 \n",
- "50% 1.000000e+00 1.700000e+01 1.000000e+00 2.100000e+01 \n",
- "75% 3.000000e+00 1.700000e+01 1.000000e+00 2.500000e+01 \n",
- "max 5.000000e+00 2.000000e+01 1.100000e+01 2.800000e+01 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id \\\n",
- "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
- "mean 1.910063e+00 7.118518e+00 1.323704e+01 3.056176e+02 \n",
- "std 1.220012e+00 1.016095e+01 1.631503e+01 1.155791e+02 \n",
- "min 1.000000e+00 1.000000e+00 2.000000e+00 1.000000e+00 \n",
- "25% 1.000000e+00 2.000000e+00 4.000000e+00 2.500000e+02 \n",
- "50% 2.000000e+00 4.000000e+00 8.000000e+00 3.280000e+02 \n",
- "75% 2.000000e+00 8.000000e+00 1.600000e+01 4.100000e+02 \n",
- "max 7.000000e+00 2.410000e+02 2.410000e+02 4.600000e+02 \n",
- "\n",
- " created_at_ts words_count \n",
- "count 1.112623e+06 1.112623e+06 \n",
- "mean 1.506598e+12 2.011981e+02 \n",
- "std 8.343066e+09 5.223881e+01 \n",
- "min 1.166573e+12 0.000000e+00 \n",
- "25% 1.507220e+12 1.700000e+02 \n",
- "50% 1.507553e+12 1.970000e+02 \n",
- "75% 1.507756e+12 2.280000e+02 \n",
- "max 1.510666e+12 6.690000e+03 "
- ]
- },
- "execution_count": 9,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click.describe()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "200000"
- ]
- },
- "execution_count": 10,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#训练集中的用户数量为20w\n",
- "trn_click.user_id.nunique()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T16:03:01.378461Z",
- "start_time": "2020-11-13T16:03:01.300712Z"
- }
- },
- "outputs": [
+ "source": [
+ "#用户点击日志信息\n",
+ "trn_click.info()"
+ ]
+ },
{
- "data": {
- "text/plain": [
- "2"
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " 1.112623e+06 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 1.221198e+05 \n",
+ " 1.951541e+05 \n",
+ " 1.507588e+12 \n",
+ " 3.947786e+00 \n",
+ " 1.815981e+00 \n",
+ " 1.301976e+01 \n",
+ " 1.310776e+00 \n",
+ " 1.813587e+01 \n",
+ " 1.910063e+00 \n",
+ " 7.118518e+00 \n",
+ " 1.323704e+01 \n",
+ " 3.056176e+02 \n",
+ " 1.506598e+12 \n",
+ " 2.011981e+02 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 5.540349e+04 \n",
+ " 9.292286e+04 \n",
+ " 3.363466e+08 \n",
+ " 3.276715e-01 \n",
+ " 1.035170e+00 \n",
+ " 6.967844e+00 \n",
+ " 1.618264e+00 \n",
+ " 7.105832e+00 \n",
+ " 1.220012e+00 \n",
+ " 1.016095e+01 \n",
+ " 1.631503e+01 \n",
+ " 1.155791e+02 \n",
+ " 8.343066e+09 \n",
+ " 5.223881e+01 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 0.000000e+00 \n",
+ " 3.000000e+00 \n",
+ " 1.507030e+12 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.166573e+12 \n",
+ " 0.000000e+00 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 7.934700e+04 \n",
+ " 1.239090e+05 \n",
+ " 1.507297e+12 \n",
+ " 4.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.300000e+01 \n",
+ " 1.000000e+00 \n",
+ " 2.000000e+00 \n",
+ " 4.000000e+00 \n",
+ " 2.500000e+02 \n",
+ " 1.507220e+12 \n",
+ " 1.700000e+02 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 1.309670e+05 \n",
+ " 2.038900e+05 \n",
+ " 1.507596e+12 \n",
+ " 4.000000e+00 \n",
+ " 1.000000e+00 \n",
+ " 1.700000e+01 \n",
+ " 1.000000e+00 \n",
+ " 2.100000e+01 \n",
+ " 2.000000e+00 \n",
+ " 4.000000e+00 \n",
+ " 8.000000e+00 \n",
+ " 3.280000e+02 \n",
+ " 1.507553e+12 \n",
+ " 1.970000e+02 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 1.704010e+05 \n",
+ " 2.777120e+05 \n",
+ " 1.507841e+12 \n",
+ " 4.000000e+00 \n",
+ " 3.000000e+00 \n",
+ " 1.700000e+01 \n",
+ " 1.000000e+00 \n",
+ " 2.500000e+01 \n",
+ " 2.000000e+00 \n",
+ " 8.000000e+00 \n",
+ " 1.600000e+01 \n",
+ " 4.100000e+02 \n",
+ " 1.507756e+12 \n",
+ " 2.280000e+02 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 1.999990e+05 \n",
+ " 3.640460e+05 \n",
+ " 1.510603e+12 \n",
+ " 4.000000e+00 \n",
+ " 5.000000e+00 \n",
+ " 2.000000e+01 \n",
+ " 1.100000e+01 \n",
+ " 2.800000e+01 \n",
+ " 7.000000e+00 \n",
+ " 2.410000e+02 \n",
+ " 2.410000e+02 \n",
+ " 4.600000e+02 \n",
+ " 1.510666e+12 \n",
+ " 6.690000e+03 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
+ "mean 1.221198e+05 1.951541e+05 1.507588e+12 3.947786e+00 \n",
+ "std 5.540349e+04 9.292286e+04 3.363466e+08 3.276715e-01 \n",
+ "min 0.000000e+00 3.000000e+00 1.507030e+12 1.000000e+00 \n",
+ "25% 7.934700e+04 1.239090e+05 1.507297e+12 4.000000e+00 \n",
+ "50% 1.309670e+05 2.038900e+05 1.507596e+12 4.000000e+00 \n",
+ "75% 1.704010e+05 2.777120e+05 1.507841e+12 4.000000e+00 \n",
+ "max 1.999990e+05 3.640460e+05 1.510603e+12 4.000000e+00 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
+ "mean 1.815981e+00 1.301976e+01 1.310776e+00 1.813587e+01 \n",
+ "std 1.035170e+00 6.967844e+00 1.618264e+00 7.105832e+00 \n",
+ "min 1.000000e+00 2.000000e+00 1.000000e+00 1.000000e+00 \n",
+ "25% 1.000000e+00 2.000000e+00 1.000000e+00 1.300000e+01 \n",
+ "50% 1.000000e+00 1.700000e+01 1.000000e+00 2.100000e+01 \n",
+ "75% 3.000000e+00 1.700000e+01 1.000000e+00 2.500000e+01 \n",
+ "max 5.000000e+00 2.000000e+01 1.100000e+01 2.800000e+01 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id \\\n",
+ "count 1.112623e+06 1.112623e+06 1.112623e+06 1.112623e+06 \n",
+ "mean 1.910063e+00 7.118518e+00 1.323704e+01 3.056176e+02 \n",
+ "std 1.220012e+00 1.016095e+01 1.631503e+01 1.155791e+02 \n",
+ "min 1.000000e+00 1.000000e+00 2.000000e+00 1.000000e+00 \n",
+ "25% 1.000000e+00 2.000000e+00 4.000000e+00 2.500000e+02 \n",
+ "50% 2.000000e+00 4.000000e+00 8.000000e+00 3.280000e+02 \n",
+ "75% 2.000000e+00 8.000000e+00 1.600000e+01 4.100000e+02 \n",
+ "max 7.000000e+00 2.410000e+02 2.410000e+02 4.600000e+02 \n",
+ "\n",
+ " created_at_ts words_count \n",
+ "count 1.112623e+06 1.112623e+06 \n",
+ "mean 1.506598e+12 2.011981e+02 \n",
+ "std 8.343066e+09 5.223881e+01 \n",
+ "min 1.166573e+12 0.000000e+00 \n",
+ "25% 1.507220e+12 1.700000e+02 \n",
+ "50% 1.507553e+12 1.970000e+02 \n",
+ "75% 1.507756e+12 2.280000e+02 \n",
+ "max 1.510666e+12 6.690000e+03 "
+ ]
+ },
+ "execution_count": 9,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trn_click.describe()"
]
- },
- "execution_count": 11,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click.groupby('user_id')['click_article_id'].count().min() # 训练集里面每个用户至少点击了两篇文章"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "##### 画直方图大体看一下基本的属性分布"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n",
- "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n"
- ]
- },
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "metadata": {},
- "output_type": "display_data"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.figure()\n",
- "plt.figure(figsize=(15, 20))\n",
- "i = 1\n",
- "for col in ['click_article_id', 'click_timestamp', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', \n",
- " 'click_region', 'click_referrer_type', 'rank', 'click_cnts']:\n",
- " plot_envs = plt.subplot(5, 2, i)\n",
- " i += 1\n",
- " v = trn_click[col].value_counts().reset_index()[:10]\n",
- " fig = sns.barplot(x=v['index'], y=v[col])\n",
- " for item in fig.get_xticklabels():\n",
- " item.set_rotation(90)\n",
- " plt.title(col)\n",
- "plt.tight_layout()\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "注:此处click_cnts直方图表示的是每篇文章对应用户的点击次数累计图\n",
- "\n",
- "也可以以用户角度分析,画出每个用户点击文章次数的直方图"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "4 1084627\n",
- "2 25894\n",
- "1 2102\n",
- "Name: click_environment, dtype: int64"
- ]
- },
- "execution_count": 13,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click['click_environment'].value_counts()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从点击环境click_environment来看,仅有2102次(占0.19%)点击环境为1;仅有25894次(占2.3%)点击环境为2;剩余(占97.6%)点击环境为4。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "1 678187\n",
- "3 395558\n",
- "4 38731\n",
- "5 141\n",
- "2 6\n",
- "Name: click_deviceGroup, dtype: int64"
- ]
- },
- "execution_count": 15,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_click['click_deviceGroup'].value_counts()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从点击设备组click_deviceGroup来看,设备1占大部分(61%),设备3占36%。"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 测试集用户点击日志"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 249999 \n",
- " 160974 \n",
- " 1506959142820 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 19 \n",
- " 19 \n",
- " 281 \n",
- " 1506912747000 \n",
- " 259 \n",
- " \n",
- " \n",
- " 1 \n",
- " 249999 \n",
- " 160417 \n",
- " 1506959172820 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 18 \n",
- " 19 \n",
- " 281 \n",
- " 1506942089000 \n",
- " 173 \n",
- " \n",
- " \n",
- " 2 \n",
- " 249998 \n",
- " 160974 \n",
- " 1506959056066 \n",
- " 4 \n",
- " 1 \n",
- " 12 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 5 \n",
- " 5 \n",
- " 281 \n",
- " 1506912747000 \n",
- " 259 \n",
- " \n",
- " \n",
- " 3 \n",
- " 249998 \n",
- " 202557 \n",
- " 1506959086066 \n",
- " 4 \n",
- " 1 \n",
- " 12 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 4 \n",
- " 5 \n",
- " 327 \n",
- " 1506938401000 \n",
- " 219 \n",
- " \n",
- " \n",
- " 4 \n",
- " 249997 \n",
- " 183665 \n",
- " 1506959088613 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 15 \n",
- " 5 \n",
- " 7 \n",
- " 7 \n",
- " 301 \n",
- " 1500895686000 \n",
- " 256 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "200000"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "0 249999 160974 1506959142820 4 \n",
- "1 249999 160417 1506959172820 4 \n",
- "2 249998 160974 1506959056066 4 \n",
- "3 249998 202557 1506959086066 4 \n",
- "4 249997 183665 1506959088613 4 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "0 1 17 1 13 \n",
- "1 1 17 1 13 \n",
- "2 1 12 1 13 \n",
- "3 1 12 1 13 \n",
- "4 1 17 1 15 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
- "0 2 19 19 281 1506912747000 \n",
- "1 2 18 19 281 1506942089000 \n",
- "2 2 5 5 281 1506912747000 \n",
- "3 2 4 5 327 1506938401000 \n",
- "4 5 7 7 301 1500895686000 \n",
- "\n",
- " words_count \n",
- "0 259 \n",
- "1 173 \n",
- "2 259 \n",
- "3 219 \n",
- "4 256 "
- ]
- },
- "execution_count": 16,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])\n",
- "tst_click.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 5.180100e+05 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 518010.000000 \n",
- " 5.180100e+05 \n",
- " 518010.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 227342.428169 \n",
- " 193803.792550 \n",
- " 1.507387e+12 \n",
- " 3.947300 \n",
- " 1.738285 \n",
- " 13.628467 \n",
- " 1.348209 \n",
- " 18.250250 \n",
- " 1.819614 \n",
- " 15.521785 \n",
- " 30.043586 \n",
- " 305.324961 \n",
- " 1.506883e+12 \n",
- " 210.966331 \n",
- " \n",
- " \n",
- " std \n",
- " 14613.907188 \n",
- " 88279.388177 \n",
- " 3.706127e+08 \n",
- " 0.323916 \n",
- " 1.020858 \n",
- " 6.625564 \n",
- " 1.703524 \n",
- " 7.060798 \n",
- " 1.082657 \n",
- " 33.957702 \n",
- " 56.868021 \n",
- " 110.411513 \n",
- " 5.816668e+09 \n",
- " 83.040065 \n",
- " \n",
- " \n",
- " min \n",
- " 200000.000000 \n",
- " 137.000000 \n",
- " 1.506959e+12 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 2.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.000000 \n",
- " 1.265812e+12 \n",
- " 0.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 214926.000000 \n",
- " 128551.000000 \n",
- " 1.507026e+12 \n",
- " 4.000000 \n",
- " 1.000000 \n",
- " 12.000000 \n",
- " 1.000000 \n",
- " 13.000000 \n",
- " 1.000000 \n",
- " 4.000000 \n",
- " 10.000000 \n",
- " 252.000000 \n",
- " 1.506970e+12 \n",
- " 176.000000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 229109.000000 \n",
- " 199197.000000 \n",
- " 1.507308e+12 \n",
- " 4.000000 \n",
- " 1.000000 \n",
- " 17.000000 \n",
- " 1.000000 \n",
- " 21.000000 \n",
- " 2.000000 \n",
- " 8.000000 \n",
- " 19.000000 \n",
- " 323.000000 \n",
- " 1.507249e+12 \n",
- " 199.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 240182.000000 \n",
- " 272143.000000 \n",
- " 1.507666e+12 \n",
- " 4.000000 \n",
- " 3.000000 \n",
- " 17.000000 \n",
- " 1.000000 \n",
- " 25.000000 \n",
- " 2.000000 \n",
- " 18.000000 \n",
- " 35.000000 \n",
- " 399.000000 \n",
- " 1.507630e+12 \n",
- " 232.000000 \n",
- " \n",
- " \n",
- " max \n",
- " 249999.000000 \n",
- " 364043.000000 \n",
- " 1.508832e+12 \n",
- " 4.000000 \n",
- " 5.000000 \n",
- " 20.000000 \n",
- " 11.000000 \n",
- " 28.000000 \n",
- " 7.000000 \n",
- " 938.000000 \n",
- " 938.000000 \n",
- " 460.000000 \n",
- " 1.509949e+12 \n",
- " 3082.000000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "#训练集中的用户数量为20w\n",
+ "trn_click.user_id.nunique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T16:03:01.378461Z",
+ "start_time": "2020-11-13T16:03:01.300712Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "2"
+ ]
+ },
+ "execution_count": 11,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "count 518010.000000 518010.000000 5.180100e+05 518010.000000 \n",
- "mean 227342.428169 193803.792550 1.507387e+12 3.947300 \n",
- "std 14613.907188 88279.388177 3.706127e+08 0.323916 \n",
- "min 200000.000000 137.000000 1.506959e+12 1.000000 \n",
- "25% 214926.000000 128551.000000 1.507026e+12 4.000000 \n",
- "50% 229109.000000 199197.000000 1.507308e+12 4.000000 \n",
- "75% 240182.000000 272143.000000 1.507666e+12 4.000000 \n",
- "max 249999.000000 364043.000000 1.508832e+12 4.000000 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
- "mean 1.738285 13.628467 1.348209 18.250250 \n",
- "std 1.020858 6.625564 1.703524 7.060798 \n",
- "min 1.000000 2.000000 1.000000 1.000000 \n",
- "25% 1.000000 12.000000 1.000000 13.000000 \n",
- "50% 1.000000 17.000000 1.000000 21.000000 \n",
- "75% 3.000000 17.000000 1.000000 25.000000 \n",
- "max 5.000000 20.000000 11.000000 28.000000 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id \\\n",
- "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
- "mean 1.819614 15.521785 30.043586 305.324961 \n",
- "std 1.082657 33.957702 56.868021 110.411513 \n",
- "min 1.000000 1.000000 1.000000 1.000000 \n",
- "25% 1.000000 4.000000 10.000000 252.000000 \n",
- "50% 2.000000 8.000000 19.000000 323.000000 \n",
- "75% 2.000000 18.000000 35.000000 399.000000 \n",
- "max 7.000000 938.000000 938.000000 460.000000 \n",
- "\n",
- " created_at_ts words_count \n",
- "count 5.180100e+05 518010.000000 \n",
- "mean 1.506883e+12 210.966331 \n",
- "std 5.816668e+09 83.040065 \n",
- "min 1.265812e+12 0.000000 \n",
- "25% 1.506970e+12 176.000000 \n",
- "50% 1.507249e+12 199.000000 \n",
- "75% 1.507630e+12 232.000000 \n",
- "max 1.509949e+12 3082.000000 "
- ]
- },
- "execution_count": 17,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tst_click.describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "我们可以看出训练集和测试集的用户是完全不一样的\n",
- "\n",
- "训练集的用户ID由0 ~ 199999,而测试集A的用户ID由200000 ~ 249999。\n",
- "\n",
- "因此,也就是我们在训练时,需要把测试集的数据也包括在内,称为全量数据。\n",
- "\n",
- "!!!!!!!!!!!!!!!后续将对训练集和测试集合并分析!!!!!!!!!!!"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "50000"
- ]
- },
- "execution_count": 18,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#测试集中的用户数量为5w\n",
- "tst_click.user_id.nunique()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:56:07.717463Z",
- "start_time": "2020-11-13T15:56:07.693494Z"
- }
- },
- "outputs": [
+ "source": [
+ "trn_click.groupby('user_id')['click_article_id'].count().min() # 训练集里面每个用户至少点击了两篇文章"
+ ]
+ },
{
- "data": {
- "text/plain": [
- "1"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### 画直方图大体看一下基本的属性分布"
]
- },
- "execution_count": 19,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tst_click.groupby('user_id')['click_article_id'].count().min() # 注意测试集里面有只点击过一次文章的用户"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻文章信息数据表"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:20:34.183761Z",
- "start_time": "2020-11-13T15:20:34.164770Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " click_article_id \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 0 \n",
- " 0 \n",
- " 1513144419000 \n",
- " 168 \n",
- " \n",
- " \n",
- " 1 \n",
- " 1 \n",
- " 1 \n",
- " 1405341936000 \n",
- " 189 \n",
- " \n",
- " \n",
- " 2 \n",
- " 2 \n",
- " 1 \n",
- " 1408667706000 \n",
- " 250 \n",
- " \n",
- " \n",
- " 3 \n",
- " 3 \n",
- " 1 \n",
- " 1408468313000 \n",
- " 230 \n",
- " \n",
- " \n",
- " 4 \n",
- " 4 \n",
- " 1 \n",
- " 1407071171000 \n",
- " 162 \n",
- " \n",
- " \n",
- " 364042 \n",
- " 364042 \n",
- " 460 \n",
- " 1434034118000 \n",
- " 144 \n",
- " \n",
- " \n",
- " 364043 \n",
- " 364043 \n",
- " 460 \n",
- " 1434148472000 \n",
- " 463 \n",
- " \n",
- " \n",
- " 364044 \n",
- " 364044 \n",
- " 460 \n",
- " 1457974279000 \n",
- " 177 \n",
- " \n",
- " \n",
- " 364045 \n",
- " 364045 \n",
- " 460 \n",
- " 1515964737000 \n",
- " 126 \n",
- " \n",
- " \n",
- " 364046 \n",
- " 364046 \n",
- " 460 \n",
- " 1505811330000 \n",
- " 479 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n",
+ "findfont: Font family ['SimHei'] not found. Falling back to DejaVu Sans.\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAABDAAAAWYCAYAAABArDYhAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAEAAElEQVR4nOzdd5gkZd318e8hIzmJSHBREEUUkRVQfBRBSQoYEMGEiGIC9TGC+Aqi+GAWFVAUBBSJoqxKEAVEVMKCZEQRCYuEJSdBwnn/uO9he4dJuzPbVbV7PtfV11ZXVVef6e2Zrv7VHWSbiIiIiIiIiIg2m6/pABERERERERERo0kBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiYEJLeLencnvsPSHr2KI+ZJMmSFhjnc79d0m/Hc4yIaLcUMCLiKZo8+Zgdkq6UtMkY9rOkNcbxPP8j6ZoRth8h6Uuze/yIiIi5je3FbV830ccd6rzD9tG2N5/o5xolx76SftrP54yYl/X9i0ZEdI/txZvOMEDSEcA0258bWGf7Bf14btt/BNbqx3NFRERERMTM0gIjIjpD0vxNZ4iIiIhC0qqSTpI0XdKdkr43xD5Ptn6UtKikb0i6QdK9ks6VtOgQj3mzpOslrTPC059T/72nthR92RAtSC3pQ5L+Iel+SV+U9BxJf5Z0n6TjJS3Us//rJV0i6Z66z4t6tn1G0s31ONdI2kzSlsBngbfWDJfWfXeRdHXd9zpJ7+85ziaSpkn6tKTbJd0i6Q2Stpb0d0l3Sfpsz/77SjpR0nH1eBdLWncs/z8Rc6MUMCLmcQ2ffCDpBEm31mOdI+kFPduOkHSIpFMkPQjsCrwd+HQ9UfhV3e96Sa+py/NL+qykf9YP+oskrTrE8y4s6euSbpR0m6TvD/VzDHrMJpKm9dxfr55I3C/pOGCRkR4fERExt6gXFX4N3ABMAlYGjh3lYV8H1gdeDiwLfBp4YtBxdwG+ArzG9hUjHOuV9d+lazeVvwyz3xb1OTeqz3co8A5gVWAdYKf6vOsBhwPvB5YDfgBMqecLawG7Ay+1vUQ95vW2TwO+DBxXMwwUFm4HXg8sCewCfEvSS3oyPYNyzrAy8HnghzXT+sD/AP9P0uo9+28HnFBfs58Bv5S04AivTcRcKwWMiHlYC04+AE4F1gSeDlwMHD1o+9uA/YElgKPq9q/WE4VthjjexyknI1tTThzeAzw0xH4HAM8FXgyswYyTiDGpV2x+CfyE8jqcALx5rI+PiIjouA2AZwKfsv2g7YdtnzvczpLmo3wmf9T2zbYft/1n24/07PYx4FPAJravnaCcX7V9n+0rgSuA39q+zva9lHOQ9ep+uwE/sH1+zXYk8Ail8PE4sDCwtqQFbV9v+5/DPaHt39j+p4s/AL+lFCYGPArsb/tRynnX8sCBtu+vOa8CeltZXGT7xLr/NynFj43G+8JEdFEKGBHztsZPPmwfXj+wHwH2BdaVtFTPLifb/pPtJ2w/PIaf6b3A52xfU08cLrV956CfQ5QTlf+1fZft+ylXUHYcw/EHbAQsCHzb9qO2TwQunIXHR0REdNmqwA22Hxvj/stTvngP+8Wfcv5wkO1pI+wzq27rWf7PEPcHxvl6FvCJ2n3kHkn3UH7GZ9bzmY9RzlNul3SspGcO94SStpJ0Xu0Ocg/losryPbvcafvxngxD5ewdf+ymgQXbTwDTKOdvEfOcFDAi5m2NnnzU7h4H1O4e9wHX9zzPgJue+sgRrTpKPoAVgKcBF/WcpJxW14/VM4Gbbbtn3Q2zEjQiIqLDbgJW09hnH7sDeBh4zgj7bA58TtJYWjR69F1myU2UVhFL99yeZvsYANs/s/0KSqHDlJamT8khaWHg55QWqyvaXho4BdA4sj3ZFbZeTFoF+Pc4jhfRWSlgRMzbmj75eBulX+drgKUo3Vhg5g/5wScoo52w3DRKPig/x3+AF/ScpCw1i7Ot3AKsXFtzDFhtFh4fERHRZRdQPgsPkLSYpEUkbTzczrXlwOHANyU9s17EeFn9wj/gSmBL4CBJ247y/NMpXVhHnOZ9FvwQ+ICkDVUsJul1kpaQtJakTWvWhynnEAPdZ28DJtXCAsBClO4m04HHJG1FOTcaj/Ulvamer32M0rXlvHEeM6KTUsCImLc1ffKxBOVD+E5Ki4gvjyHzbYx8svIj4IuS1qwnIC+StNwQP8cPKYNqPR1A0sqSthjD8w/4C/AY8BFJC0p6E6VLTkRExFyvdoHYhjKO1I2Ubg1vHeVhnwQup3S5vIvSimGm7yO2L6UMgPnD+uV/uOd/iDJG1p9qa8pxjQlheyrwPuB7wN3AtcC76+aFKWNn3QHcShm3a6+67YT6752SLq7dUj8CHF+P8zZgyniyASdTXtu7gXcCb6rjYUTMczRz6+eImNdIWg34DmVwKVNGt74YeG9tKokkA2vavrbO1PF/wFso/TMvpYzGvSLwL2BB249Jmgz8Bni37VOHee7FKYNybko5kfl/wJE9z3UEMM3253oesyblZGEScLbtN0i6vub9XR2YdC/KjCXLA38D3mh72qCfYxHKoJ071v1uBg6x/Z0RXqtNgJ/aXqXen0wphKxBaR4K8I/evBERERGzS9K+wBq239F0log2SAEjIiIiIiKihVLAiJhZupBERERERETrSHq7pAeGuF3ZdLaIaEZaYETEHCXp7cAPhth0g+0X9DvPaCR9FvjsEJv+aHvYvrgRERERETFnpYAREREREREREa031qkT5wnLL7+8J02a1HSMiIiITrnooovusL1C0znaJucVERERs2e4c4sUMHpMmjSJqVOnNh0jIiKiUyTd0HSGNsp5RURExOwZ7twig3hGREREREREROulgBERERERERERrddIAUPS4ZJul3RFz7qvSfqbpMsk/ULS0j3b9pJ0raRrJG3Rs37Luu5aSXv2rF9d0vl1/XGSFurbDxcRERERERERE66pFhhHAFsOWncGsI7tFwF/B/YCkLQ2sCPwgvqYgyXNL2l+4CBgK2BtYKe6L8BXgG/ZXgO4G9h1zv44ERER0VWSFpF0gaRLJV0p6QtD7LNwvShybb1IMqmBqBEREfO0RgoYts8B7hq07re2H6t3zwNWqcvbAcfafsT2v4BrgQ3q7Vrb19n+L3AssJ0kAZsCJ9bHHwm8YU7+PBEREdFpjwCb2l4XeDGwpaSNBu2zK3B3vTjyLcrFkoiIiOijto6B8R7g1Lq8MnBTz7Zpdd1w65cD7ukphgysH5Kk3SRNlTR1+vTpExQ/IiIiusLFA/XugvXmQbttR7koAuUiyWb1oklERET0SeumUZW0N/AYcHQ/ns/2ocChAJMnTx58shIRLbb/O7ZvOsJT7P3TE0ffKSJap3ZNvQhYAzjI9vmDdnnywontxyTdS7locseg4+wG7Aaw2mqrzenYMRf7wytf1XSEp3jVOX9oOkJEjNO6J57edISnuHT7LUbfqWpVCwxJ7wZeD7zd9kAx4WZg1Z7dVqnrhlt/J7C0pAUGrY+IiIgYku3Hbb+Yct6wgaR1ZvM4h9qebHvyCiusMKEZIyIi5nWtKWBI2hL4NLCt7Yd6Nk0BdqyDZ60OrAlcAFwIrFlnHFmIMtDnlFr4OAsYuDS7M3Byv36OiIiI6C7b91DOIwYPNv7khZN6kWQpykWTiIiI6JOmplE9BvgLsJakaZJ2Bb4HLAGcIekSSd8HsH0lcDxwFXAa8OF6leQxYHfgdOBq4Pi6L8BngI9LupbSvPOwPv54ERER0SGSVhiYvl3SosBrgb8N2m0K5aIIlIskZ/a0Fo2IiIg+aGQMDNs7DbF62CKD7f2B/YdYfwpwyhDrr6PMUhIRERExmpWAI+s4GPNRLor8WtJ+wFTbUyjnKT+pF0fuorT8jIiIiD5q3SCeEREREf1k+zJgvSHWf75n+WHgLf3MFRERETNrzRgYERERERERERHDSQuMudSN+72w6QhPsdrnL286QkRERERERHRUWmBEREREREREROulBUZERIzJvvvu23SEIbU1V0RERERMrLTAiIiIiIiIiIjWSwuMEaz/qaOajjCki772rqYjRERERERERPRVWmBEREREREREROulgBERERERERERrZcCRkRERERERES0XgoYEREREREREdF6KWBEREREREREROulgBERERERERERrZcCRkRERERERES03gJNB4iIZn3vE79qOsJT7P6NbZqOEBERERERLZMWGBERERERERHReilgRERERERERETrpYAREREREREREa3XWAFD0uGSbpd0Rc+6ZSWdIekf9d9l6npJ+o6kayVdJuklPY/Zue7/D0k796xfX9Ll9THfkaT+/oQRERERERERMVGaHMTzCOB7wFE96/YEfm/7AEl71vufAbYC1qy3DYFDgA0lLQvsA0wGDFwkaYrtu+s+7wPOB04BtgRO7cPPFRERLXP8CRs0HeEpdnjLBU1HiErSqpTzkRUp5xOH2j5w0D6bACcD/6qrTrK9Xx9jRkREzPMaa4Fh+xzgrkGrtwOOrMtHAm/oWX+Ui/OApSWtBGwBnGH7rlq0OAPYsm5b0vZ5tk05KXkDEREREU/1GPAJ22sDGwEflrT2EPv90faL6y3Fi4iIiD5r2xgYK9q+pS7fSrkSArAycFPPftPqupHWTxti/VNI2k3SVElTp0+fPv6fICIiIjrF9i22L67L9wNXM8x5Q0RERDSnbQWMJ9WWE+7D8xxqe7LtySussMKcfrqIiIhoMUmTgPUoXVAHe5mkSyWdKukFwzw+F0YiIiLmkCbHwBjKbZJWsn1L7QZye11/M7Bqz36r1HU3A5sMWn92Xb/KEPtHzDF/eOWrmo7wFK865w9NR4iI6AxJiwM/Bz5m+75Bmy8GnmX7AUlbA7+kjM01E9uHAocCTJ48eY5fiImIiJiXtK0FxhRgYCaRnSmDZQ2sf1edjWQj4N7a1eR0YHNJy9QZSzYHTq/b7pO0UZ195F09x4qIiIiYiaQFKcWLo22fNHi77ftsP1CXTwEWlLR8n2NGRETM0xprgSHpGErrieUlTaPMJnIAcLykXYEbgB3q7qcAWwPXAg8BuwDYvkvSF4EL63772R4YGPRDlJlOFqXMPpIZSCIiIuIp6sWOw4CrbX9zmH2eAdxm25I2oFwEurOPMSMiIuZ5jRUwbO80zKbNhtjXwIeHOc7hwOFDrJ8KrDOejBERETFP2Bh4J3C5pEvqus8CqwHY/j6wPfBBSY8B/wF2rOcnERER0SdtGwMjIiIioq9snwtolH2+B3yvP4kiIiJiKG0bAyMiIiIiIiIi4ilSwIiIiIiIiIiI1ksXkoiIPrt6/zObjvAUz99706YjRERERESMKC0wIiIiIiIiIqL1UsCIiIiIiIiIiNZLASMiIiIiIiIiWi9jYERERMRcRdKbgFcABs61/YuGI0VERMQESAuMiIiImGtIOhj4AHA5cAXwfkkHNZsqIiIiJkJaYERERMTcZFPg+bYNIOlI4MpmI0VERMRESAuMiIiImJtcC6zWc3/Vui4iIiI6Li0wIiIiYm6yBHC1pAvq/ZcCUyVNAbC9bWPJIiIiYlxmu4BRB8galu2TZvfYEREREbPp800HiIiIiDljPC0wtqn/Ph14OXBmvf9q4M9AChgRERHRV7b/ACBpSXrOc2zf1VioiIiImBCzXcCwvQuApN8Ca9u+pd5fCThiQtJFREREzAJJuwH7AQ8DTwCiTKf67CZzRURExPhNxBgYqw4UL6rbmHnwrIiIiIh++RSwju07mg4SERERE2siChi/l3Q6cEy9/1bgdxNw3IiIiIhZ9U/goaZDRERExMQbdwHD9u51QM//qasOtf2L8R43IiIiYjbsBfxZ0vnAIwMrbX+kuUgRERExESZkGtU648iEDNop6X+B91L6q14O7AKsBBwLLAdcBLzT9n8lLQwcBawP3Am81fb19Th7AbsCjwMfsX36ROSLiIiIVvsBZWDxyyljYERERMRcYr7ZfaCkc+u/90u6r+d2v6T7ZvOYKwMfASbbXgeYH9gR+ArwLdtrAHdTChPUf++u679V90PS2vVxLwC2BA6WNP/s/qwRERHRGQva/rjtH9s+cuA20gMkrSrpLElXSbpS0keH2EeSviPpWkmXSXrJnPsRIiIiYiizXcCw/Yr67xK2l+y5LWF7yYH9JC0zi4deAFhU0gLA04BbgE2BE+v2I4E31OXt6n3q9s0kqa4/1vYjtv8FXAtsMMs/ZERERHTNqZJ2k7SSpGUHbqM85jHgE7bXBjYCPlwvhvTaCliz3nYDDpnw5BERETGi2S5gzILfj3VH2zcDXwdupBQu7qV0GbnH9mN1t2nAynV5ZeCm+tjH6v7L9a4f4jEzqSc5UyVNnT59+lijRkRERDvtRB0Hg3IOcREwdaQH2L7F9sV1+X7gap563rAdcJSL84Cl69TxERER0ScTMgbGKDTmHUtrje2A1YF7gBMoXUDmGNuHAocCTJ482XPyuSIiImLOsr36eB4vaRKwHnD+oE3DXRzpnUoeSbtRWmiw2mpPnVV+/U8dNZ54c8RFX3vXmPa7cb8XzuEks261z18+6j4bf3fjPiSZNX/a409NR5ijvveJXzUd4Sl2/8Y2o+6z/zu270OSWbP3T08cfSfg6v3PnMNJZt3z99501H323XffOR9kFo0l0/EntLNx/w5vuaDpCHNcPwoYs1IUeA3wL9vTASSdBGxMucqxQG1lsQpwc93/ZmBVYFrtcrIUZTDPgfUDeh8TERHRCeue2L7xpy/dfoumI4xK0jrA2sAiA+tsj1o5kLQ48HPgY7ZnazyvXBiJiIiYc/rRhWRW3AhsJOlpdSyLzYCrgLOAgZLozsDJdXlKvU/dfqZt1/U7SlpY0uqU/qpzfzkqIiJiHidpH+C79fZq4KvAtmN43IKU4sXRdXa1wXJxJCIiomH9KGCMuQuJ7fMpg3FeTJn+bD7KVYzPAB+XdC1ljIvD6kMOA5ar6z8O7FmPcyVwPKX4cRrwYduPT8hPExEREW22PeUCyK22dwHWpbTQHFa9aHIYcLXtbw6z2xTgXXU2ko2Ae23fMsy+ERERMQdMSBcSSa8A1rT9Y0krAIvX2T+gnESMme19gH0Grb6OIWYRsf0w8JZhjrM/sP+sPHdERER03n9sPyHpMUlLArczc8uJoWwMvBO4XNIldd1ngdUAbH8fOAXYmjKz2UPALnMge0RERIxg3AWM2lRzMrAW8GNgQeCnlJMBbN813ueIiIiIGKOpkpYGfkiZgeQB4C8jPcD2uYzSYrR2Uf3wBGWMiIiI2TARLTDeSBmte2D6sX9LWmICjhsRERExS2x/qC5+X9JpwJK2L2syU0REREyMiRgD47/1qoQBJC02AceMiIiImGWSfj+wbPt625f1rouIiIjumogWGMdL+gFlqtP3Ae+hNNuMiIiI6AtJiwBPA5aXtAwzuoQsCazcWLCIiIiYMOMuYNj+uqTXAvdRxsH4vO0zxp0s5kkbf3fjpiM8xZ/2+FPTESIiYnTvBz4GPJMy9sVAAeM+4HsNZYqIiIgJNCGzkNSCRYoWERER0QjbBwIHStrD9nebzhMRERETb7bHwJB0v6T7ev69r/f+RIaMiIiIGKNbBwYTl/Q5SSdJeknToSIiImL8ZruAYXsJ20v2/Ltk7/2JDBkRERExRv/P9v2SXgG8BjgMOKThTBERETEBxj0LiaSNeqdNlbSEpA3He9yIiIiI2fB4/fd1wKG2fwMs1GCeiIiImCATMY3qIcADPfcfJFc6IiIiohk319nR3gqcImlhJuZ8JyIiIho2ER/osu2BO7afYIIGB42IiIiYRTsApwNb2L4HWBb4VKOJIiIiYkJMRAHjOkkfkbRgvX0UuG4CjhsRERExS2w/BNwOvKKuegz4R3OJIiIiYqJMRAHjA8DLgZuBacCGwG4TcNyIiIiIWSJpH+AzwF511YLAT5tLFBERERNl3F09bN8O7DgBWSIiIiLG643AesDFALb/3TvYeERERHTXbBcwJH3a9lclfRfw4O22PzKuZBERERGz7r+2LckAkhZrOlBERERMjPG0wLi6/jt1IoJERERETIDj6ywkS0t6H/Ae4IcNZ4qIiIgJMNsFDNu/qosP2T6hd5ukt4wrVURERMRssP11Sa8F7gPWAj5v+4yGY0VERMQEmIjpTvcCThjDuoiIiIg5zvYZks6nnudIWtb2XQ3HioiIiHEazxgYWwFbAytL+k7PpiUpU5bN7nGXBn4ErEMZW+M9wDXAccAk4HpgB9t3SxJwYM3xEPBu2xfX4+wMfK4e9ku2j5zdTBEREdENkt4PfAF4GHgCEOV84tlN5oqIiIjxG880qv+mjH/xMHBRz20KsMU4jnsgcJrt5wHrUsba2BP4ve01gd/X+wBbAWvW227AIVCutAD7UKZ03QDYR9Iy48gUERER3fBJYB3bk2w/2/bqtkctXkg6XNLtkq4YZvsmku6VdEm9fX7Ck0dERMSIxjMGxqX1Q36LiWrdIGkp4JXAu+tz/Bf4r6TtgE3qbkcCZ1PmeN8OOMq2gfMkLS1ppbrvGQPNRSWdAWwJHDMROSMiIqK1/klplTmrjgC+Bxw1wj5/tP362QkVERER4zeuMTBsPy5pVUkL1WLDeK0OTAd+LGldSouOjwIr2r6l7nMrsGJdXhm4qefx0+q64dY/haTdKK03WG211SbgR4iIiIgG7QX8uY6B8cjAytGmd7d9jqRJczhbREREjMNEDOL5L+BPkqYADw6stP3N2czzEmAP2+dLOpAZ3UUGjvvk3O4TwfahwKEAkydPnrDjRkRERCN+AJwJXE4ZA2MivUzSpZRutJ+0feXgHXJhJCIiYs6ZiALGP+ttPmCJcR5rGjDN9vn1/omUAsZtklayfUvtInJ73X4zsGrP41ep625mRpeTgfVnjzNbREREtN+Ctj8+B457MfAs2w9I2hr4JWUMrpnkwkhERMScM+4Chu0vTESQeqxbJd0kaS3b1wCbAVfV287AAfXfk+tDpgC7SzqWMmDnvbXIcTrw5Z6BOzenNCmNiIiIuduptRXEr5i5C8m4plG1fV/P8imSDpa0vO07xnPciIiIGLtxFzAkrQB8GngBsMjAetubzuYh9wCOlrQQcB2wC6V1x/GSdgVuAHao+55CmUL1WsqAXbvU575L0heBC+t++2X+94iIiHnCTvXf3gsX455GVdIzgNtqV9YNKOcmd47nmBERETFrJqILydHAccDrgQ9QWkhMn92D2b4EmDzEps2G2NfAh4c5zuHA4bObIyIiIrrH9uqz8zhJx1C6ny4vaRplOvYF6zG/D2wPfFDSY8B/gB3reUhERET0yUQUMJazfZikj9r+A/AHSReO+qiIiIiICSJpU9tnSnrTUNttnzTS423vNMr271GmWY2IiIiGTEQB49H67y2SXkcZmXvZCThuRERExFi9ijL7yDZDbDMwYgEjIiIi2m8iChhfkrQU8Angu8CSwP9OwHEjIiIixsT2PnVxP9v/6t0maba6lURERES7zDfeA9j+te17bV9h+9W217c9ZWC7pMz+EREREf3y8yHWndj3FBERETHhJqIFxmjeAvxfH54nIiIi5lGSnkeZEW2pQeNgLEnPLGkRERHRXf0oYKgPzxERERHztrUoM6ItzczjYNwPvK+JQBERETGx+lHAyBRjERERMUfZPhk4WdLLbP9luP0k7WU7LUMjIiI6aNxjYIxBWmBEREREX4xUvKje0pcgERERMeHGXcCQ9JQpUweN9n3CeJ8jIiIiYoLkwkpERERHTUQLjF9JWnLgjqS1gV8N3Lf95Ql4joiIiIiJkK6tERERHTURBYwvU4oYi0tan9Li4h0TcNyIiIiIiZYWGBERER017kE8bf9G0oLAb4ElgDfa/vu4k0VERETMIknL2r5r0LrVbf+r3k3X1oiIiI6a7QKGpO8yczPMpYB/ArtLwvZHxhsuIiIiYhb9StJWtu+DJ7u2Hg+sA+naGhER0WXjaYExddD9i8YTJCIiImICDHRtfR2wFnAU8PZmI0VERMREmO0Chu0jASQtBjxs+/F6f35g4YmJFxERETF26doaEREx9xr3GBjA74HXAA/U+4tSThpePgHHjoiIiBhVurZGRETM/SaigLGI7YHiBbYfkPS0CThuRERExFila2tERMRcbiIKGA9KeontiwHqVKr/mYDjRkRERIxJurZGRETM/eabgGN8DDhB0h8lnQscB+w+ngNKml/SXyX9ut5fXdL5kq6VdJykher6hev9a+v2ST3H2Kuuv0bSFuPJExEREZ3xe0p31gGLAr9rKEtERERMoHEXMGxfCDwP+CDwAeD5tsfbbPOjwNU9978CfMv2GsDdwK51/a7A3XX9t+p+A1Om7Qi8ANgSOLhegYmIiIi521O6tgKjdm2VdLik2yVdMcx2SfpOvThymaSXTGDmiIiIGIPZLmBI2rT++yZgG+C59bZNXTe7x10FeB3wo3pfwKbAiXWXI4E31OXt6n3q9s3q/tsBx9p+xPa/gGuBDWY3U0RERHTGg73FhVno2noE5aLHcLYC1qy33YBDxpExIiIiZsN4xsB4FXAmpXgxmIGTZvO43wY+TZn6DGA54B7bj9X704CV6/LKwE0Ath+TdG/df2XgvJ5j9j5mJpJ2o5yIsNpqq81m5IiIiGiJj1G6tv4bEPAM4K2jPcj2Ob1dUYewHXCUbQPnSVpa0kq2b5mAzBERETEGs13AsL1P/XeXiQoj6fXA7bYvkrTJRB13JLYPBQ4FmDx5skfZPSIiIlrM9oWSngesVVddY/vRCTj0kxdNqoGLIzMVMHJhJCIiYs6Z7QKGpI+PtN32N2fjsBsD20raGlgEWBI4EFha0gK1FcYqwM11/5uBVYFpkhagzPl+Z8/6Ab2PiYiIiLmMpE1tnzlEN9bnSsL27LYMnSW5MBIRETHnjGcQzyVGuC0+Owe0vZftVWxPogzCeabttwNnAdvX3XYGTq7LU+p96vYza9POKcCOdZaS1Sn9VS+YnUwRERHRCa+q/24zxO31E3D8XByJiIho2Hi6kHwBQNKRwEdt31PvLwN8Y0LSzfAZ4FhJXwL+ChxW1x8G/ETStcBdlKIHtq+UdDxwFfAY8OGB+eAjIiJi7jMnurYOMgXYXdKxwIbAvRn/IiIior/GM4jngBcNFC8AbN8tab3xHtT22cDZdfk6hphFxPbDwFuGefz+wP7jzRERERHtN96urZKOATYBlpc0DdgHWLA+9vvAKcDWlJnNHgLmVKEkIiIihjERBYz5JC1j+24ASctO0HEjIiIixmqJEbaNOhaF7Z1G2W7gw7MaKiIiIibORBQavgH8RdIJ9f5bSMuHiIiI6KM+d22NiIiIBoy7gGH7KElTgU3rqjfZvmq8x42IiIiYDXOka2tEREQ0b0K6etSCRYoWERER0bR0bY2IiJhL5QM9IiIi5ibp2hoRETGXSgEjIiIi5hrp2hoRETH3SgEjIiIi5irp2hoRETF3mq/pABERERERERERo0kBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9VhUwJK0q6SxJV0m6UtJH6/plJZ0h6R/132Xqekn6jqRrJV0m6SU9x9q57v8PSTs39TNFRERERERExPi1qoABPAZ8wvbawEbAhyWtDewJ/N72msDv632ArYA162034BAoBQ9gH2BDYANgn4GiR0RERMRgkraUdE29KLLnENvfLWm6pEvq7b1N5IyIiJiXtaqAYfsW2xfX5fuBq4GVge2AI+tuRwJvqMvbAUe5OA9YWtJKwBbAGbbvsn03cAawZf9+koiIiOgKSfMDB1EujKwN7FQvoAx2nO0X19uP+hoyIiIi2lXA6CVpErAecD6wou1b6qZbgRXr8srATT0Pm1bXDbc+IiIiYrANgGttX2f7v8CxlIskERER0SKtLGBIWhz4OfAx2/f1brNtwBP4XLtJmipp6vTp0yfqsBEREdEdY73w8eY65taJklYd6kA5r4iIiJhzWlfAkLQgpXhxtO2T6urbatcQ6r+31/U3A70nEKvUdcOtfwrbh9qebHvyCiusMHE/SERERMxNfgVMsv0iStfUI4faKecVERERc06rChiSBBwGXG37mz2bpgADM4nsDJzcs/5ddTaSjYB7a1eT04HNJS1TB+/cvK6LiIiIGGzUCx+277T9SL37I2D9PmWLiIiIaoGmAwyyMfBO4HJJl9R1nwUOAI6XtCtwA7BD3XYKsDVwLfAQsAuA7bskfRG4sO63n+27+vITRERERNdcCKwpaXVK4WJH4G29O0haqWc8rm0pA41HREREH7WqgGH7XEDDbN5siP0NfHiYYx0OHD5x6SIiImJuZPsxSbtTWmvODxxu+0pJ+wFTbU8BPiJpW8qU73cB724scERExDyqVQWMiIiIiCbYPoXSsrN33ed7lvcC9up3roiIiJihVWNgREREREREREQMJQWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9ubqAIWlLSddIulbSnk3niYiIiHYa7ZxB0sKSjqvbz5c0qYGYERER87S5toAhaX7gIGArYG1gJ0lrN5sqIiIi2maM5wy7AnfbXgP4FvCV/qaMiIiIubaAAWwAXGv7Otv/BY4Ftms4U0RERLTPWM4ZtgOOrMsnAptJUh8zRkREzPNku+kMc4Sk7YEtbb+33n8nsKHt3QfttxuwW727FnDNHIq0PHDHHDr2nNTV3NDd7F3NDd3N3tXc0N3syd1/czL7s2yvMIeOPceN5ZxB0hV1n2n1/j/rPncMOla/ziugu+/H5O6/rmbvam7obvau5obuZk/uoQ15brHAHHzCTrB9KHDonH4eSVNtT57TzzPRupobupu9q7mhu9m7mhu6mz25+6/L2bukX+cV0N3/0+Tuv65m72pu6G72ruaG7mZP7lkzN3chuRlYtef+KnVdRERERK+xnDM8uY+kBYClgDv7ki4iIiKAubuAcSGwpqTVJS0E7AhMaThTREREtM9YzhmmADvX5e2BMz239sONiIhoqbm2C4ntxyTtDpwOzA8cbvvKBiP1pTnpHNDV3NDd7F3NDd3N3tXc0N3syd1/Xc4+Rw13ziBpP2Cq7SnAYcBPJF0L3EUpcjStq/+nyd1/Xc3e1dzQ3exdzQ3dzZ7cs2CuHcQzIiIiIiIiIuYec3MXkoiIiIiIiIiYS6SAERERERERERGtlwJGRERERERERLReChhzmKRlJS3bdI6ImKFOgTiwvLikyfk9jbHI+yQiIiKiOSlgzAGSVpN0rKTpwPnABZJur+smNRwvYp4m6d3AbZL+Lmkr4DLgK8ClknZqNNwskLSGpDdLWrvpLOMhafGmMwxH0saSrpZ0paQNJZ0BXCjpJkkvazrfSCS9p2d5FUm/l3SPpD9Lem6T2SIiIiJmVwoYc8ZxwC+AZ9he0/YawErAL4Fjmww2VpJWkLSepBe1+QvGYF3MLWk+SfPV5YUkvaSLV3k79IX6E8BawBaU39XX2t4MmAzs1WSwkUg6S9LydfmdwCnAVsBxkvZoNNz4XNV0gBF8C9gBeC/wG+ALtp8DbAd8vclgY7B7z/I3Ke/1ZYGvAYc0kigmRG0x9kZJ20p6XtN5RlNbon5e0ntV7C3p15K+JmmZpvPNKklnNp1hLOrFtKXr8iRJ20tap+FYI6rv62Xr8gqSjpJ0uaTjJK3SdL7RSHq1pO9JOlnSSZIOkLRG07lmhaTVJb2pI39bdu85L1pD0jm1UH++pBc2nW8kkraQtOvgC9u9Fx/aqF7MWbIuLyrpC5J+JekrkpbqV44UMOaM5W0fZ/vxgRW2H7d9LLBcg7lGJWltSb8D/kJpPfJD4HJJR/TzjTmrOpz7DcAtwM2StgP+SPmCcZmkbZrMNpoOf6F+3PYdtv8FPGD7nwC2b2s412hWsH1HXf4I8DLb7wU2BN7XXKzRSfr4MLdPAG0uNC5o+3LbfwGm2z4XwPbFwKLNRpslz7V9qO0nbP+CUsiIjpH0KklTgQOAw4HdgMMknS1p1WbTjeinwGLA+sBZwDMord7+AxzRXKzRSbps0O1yYOOB+03nG46kPYE/AOdJei9wGjM+nz/eaLiR7W/7rrr8PeCvlNynAj9uLNUYSPo/4F3AecCjwD/r7QRJb2ky20gk/bJneTvgTGAb4GSVFqtt9sGe86IDgW/ZXhr4DPD9xlKNQtKXgb2BFwK/H3TOvPvQj2qNw4GH6vKBwFKUv+cP0cff0QVG3yVmw0WSDgaOBG6q61YFdqb8MW6zw4GdbV8jaQPgw7Y3lPQ+4DBg+2bjDaurufcB1qV8GboUeGn9GZ4F/Bz4VZPhRjHUF+o7JT2N8gH+3eaijejGeqKxBPA3Sd8ATgJeQykmtdWjkla2fTPwAPBgXf8IMH9zscbky5TC3GNDbGtzIb032+DWOQv1M8hsWEXSdwABK0ha0PajdduCDeaK2fdtYHPb0yWtDnzT9saSXkv5nNu80XTDe6btrSUJmGZ7k7r+j5IuaS7WmFwP3Ad8iVJwEeVCQ6svMADvBNYGnkb5GZ5d3zeLUS7yfLPBbCPp/Sxbw/Zb6/IRkj7WQJ5Z8XrbLwSQdCzwB9ufknQi5T1zQqPphvesnuXPAJva/le9QPV72l1k7P0e+/RaoMf22ZKWaCjTWGwDrGf7MUn7Aj+T9Gzb/0v5G9Nm89keOJebbPsldfncfv49b/OJY5e9C7gc+AJwer3tC1xB+VBps0VtXwNg+wJKdRDbPwRe0GSwUXQ1N7Zvra0Bbuz5GW6g/b+fj0pauS536Qv1OygnpNOAbSmtdvYCVgTe3VysUf0v8FtJ+wFXAmdK2odyZa3VV6aAi4Ff2v7C4Btwf9PhRvD/akEO278cWCnpOcBRTYUao08BFwFTgc9SW7pIegYwpcFcMfvmtz29Lt9I/eJh+wxg5WEf1bz5aleRVYHFB5pMS1qOlhcCbW9LuZhwKLCu7euBR23fUD+n2+px2/8B7qEUXu4EsP3gSA9qgbMl7Sdp0br8RihdM4B7m402qic0o/vvM6nnQbbvpt1fSt2zvEA9H6VeoHqimUhjdmJtaf1s4BeSPibpWZJ2ofyNbKsFBooAtu+hFDSWlHQCLf+bCFxRX18oY8dNBlAZW+vR4R82sWR79L1iniHpJEorkTOBNwHL2H6PpAWBK2yv1WjAYXQ491+B9W0/IWmDWnxB0vzApbZb219V0ibAQZSTu2WBl1CKda8ATrfd9jECOqd2h3ob8FzKlYdpwMm2/9ZosFFIWgu4q+fLV++2FTvQfSeicZIOp3zZOJNSfL3Z9sdrke1i263ss64yOPK3690PAR+k/BxrU8aVObShaGNWWy58EXgO5TO71eMxSDqC8kVoMUrT7scoxe5NgSVs79BcuuHVc7a9gYFxAFahXBz5FbCn7dZ+KZX0VuCrwN8p42x90PZvJK0AHGj7bY0GHIakxymvsYCFgWfZvkXSQsBU2y9qNOAoajeXD1J+NxemtHz/JfAV260sekn6NfA1238YtP5LwGdtt/YCZj0PPRD4H+AOyrn/TfX2EduX9iVHChj9JenztvdrOsdwVAZ8+izlxOJS4ADb99c37PNtn9dkvuF0OPdLgcttPzxo/STgFbZ/2kiwMeriF2qVAVN3Bt5MuSL4OOWE45DBHyYxb6uFxPdSTqJPs/2nnm2fs/2lxsLNBkl/t50ZSDqqfrl7HzM+5w63/Xi9Wv30NrcIqL9Lqk2mFwBeTCnAtLnb3lNIWpfSXbK1/evhyanC30IpFJ0IbED5rL4ROKgDLTEGzi8WsH1n01nGqrbAeDZwbb2y3ln1vPr5dQyomED1bza1ldTgbQNdhVtNZSDP1ann/v2+EJUCRp9JutH2ak3nmBdIerrt25vOEe0i6cfADcDvKGOj3Efpn/oZSvGlrWN3DEvSobZ3azrHcLpaCJD0I0of8gso3f/+YPvjddvFPX0/W0fS/ZQvL71Nl59GuRpr20s2EizmWbWp8ZNF4zYXukci6cu2P9t0jnlF117vLr/Pa1evx23f13SWsZC0LfDbwRcB2662bnnU9Ut47SL1EuAq26c2Gm42SFrWMwbf7c9zpoAx8SQN94svylgNrR08tVa896JME7gi5QT4duBkSquGe5pLNzw9ddpRUfp/r0d5n/f1F2usagVzL8oXu1Nt/6xn28G2P9RYuFHUbjsnUcY2eKDpPGMl6bLeJpGSzrO9kaSFgUtsP7/BeMMa4j3+5CZKd6PWNmnuaiGg971Sr2geDCwP7AScZ3u9JvONpA7guTTwqYErI5L+ZXv1RoPFbOvq54WkVwHfoIzHsD7wJ2AZSn/pd9q+afhHN6v+Hg32LuoYOLY/0t9EY9Ph98rg11uUz4xWv97Q3fe5pGdSZjbajjJW0sDV/8Mps8L0bVyDWSXpP5TuL6cCx1C6Lz8+8qOaJ+lSYBPbd0v6FPBGykx+r6J02xk8aHhr9F50krQ2pbvOgpTf1bfaPr8fOVrbx6bj7gHWtL3koNsStHuWA4DjgbuBV9te1vZywKvruuMbTTayOygFi4HbVMqgZhfX5bb6MeWX/ufAjpJ+Xr9IA2zUXKwx2RB4A2VWj+NV5m9v++BDUAYffQ6ApJcA/wWw/QgzD2bVNtMp7+XB7/OpwNMbzDUWG9h+m+1vU943i0s6qb7X2zy42ZPvZ9uP1VYul1DGIGjz9K8DJ/oHAsdI+kjtOtXm93eMrqufF98GtrL9GspVxkdtbwzsT5k9pc3eSBnjqfdv76M9y23V1ffK4Nd7Kt14vaG77/OfUrqjLUXpdvRz4PmUrgEHNRlsDP4GrAmcA3wC+Lek79diUpvNXwd3BXgrsFktCmwFvK65WGPypp7lrwEfrRdGdgC+1a8QKWDMGUcx87REvX42zPq2mGT7K7ZvHVjhMkvGVxj+Z2qDTwHXANvaXr3+Mk2ry89uONtInmN7T9u/dBnt/GLK7BLLNR1sDG63vT0wiTLA1vuAmyX9WFJbp/OD8l45S9I/KB/UnwKoA239uslgo7iOUrFfvef27Ppeb/sgmF0tBEyVtGXvijqG0Y8p7/tWs30RZXpggD8AizQYJ8avq58XXZ09Bcp4I3cAWwJn2D4SuN/2kXW5rbr6Xunq6w3dfZ8vZ/tsANsnAa+0/aDtzwGvbDTZ6Gz7bts/tL0ZsC5wFXCApFa2eKnukzQwSP8dzPhsXoBufTd/5kCXF5dJCBbt1xO3titDl9Vf+uG2faafWWbDDZI+DRzZ0+x4YHrJ1v4xsP0NSccB36p/tPahG1cbF5Y0n+0nAGzvL+lmSjW5zV/soL6+ta/kT4Cf1JOjtwB7Ar9tMNuwbJ8p6VmUD+07etZPBz7dXLJRfZvSHHWoUdi/2t8os2yqpC1tnzawwvZ+kv4NHNJgrhHZfscw638E/KjPcWZL/dvyHZXp2Vrb5SXGpKufF1MlHcaM2VPOBqizp7R5ym1s3w98TNL6wNGSfkM3vmB08r3S4dcbuvs+ny7pHcBZlKvr1wNIEu1/7WdqwVkvvn6H8pnX5ouuH6C8vy+ldNOfKukc4IXAlxtNNrpnS5pCee1XkfQ02w/VbQv2K0TGwJiDJC04uO+YpOV7vzS1TR3AZ09mjIEBcCswhTIlUSvHkuhVB/X5LKU1yTOazjMSSV+lDED0u0HrtwS+a3vNZpKNTtI5tttenX8KSatRWo88XD+g300dPAn4oevc3BHwZF/yFWz/c9D6F9m+rKFYY9Ll7PFUXf28UIdnT+lVPy8+RJmFZMjiZlt09b3Sq0uvN3T3fV7Pib5OyX0JZdykW+oFqU1s/7zJfCORtMlA65GuURncfHNmnsXvdLd0rMEBQ3TPucj2A/Vi9/a2+9LtKAWMOaCOJvsTSpOgi4HdbF9ft7V2wLq5Sf3AeI7tK5rOEu0i6QrKmAwPSfoKZe7wXwKbAth+zwgPbyVJr63NVFuri1+mJe1AaflyO+XKwrttX1i3tfpveZezR0TMDSQt5w5NAxvRFW1vGtRVXwW2sL08cChwhqSBQZPaPGAdAJKeLemTkg6U9E1JH6hfPlqrDlK36sB92//pQvFC0oYDr62kRSV9QdKvJH1FZUaYTpL02qYzjGC+nuZurwF2sP3TWrhYv8Fc49HmAcIGvkz/Dfi5pCslvbRn8xHNpBqTzwLr234xsAulm9Qb67a2/y3vcvYYgqRlJX1e0ntV7C3p15K+VltPdo6kzk0ZOEDS5U1nGEkXz+VG0oHX+wBJy9flyZKuA86XdMMQV61bQ9ICkt4v6VRJl9XbqfX90rcuAbND0qqSjpX0R0mf7c0r6ZcNRpttbf+bKGn3nvf5GpLOkXSPpPM1Y1yPOS5jYMwZC9m+EsD2iZKuBk6S9BlaPi6DpI8A21AGfHsp8FfKfNbnSfpQi5tqfRHYU9I/KVMpndAzmFKbHU4ZdAjKjAEPAV8BNqMMFPimYR7XdocBqzUdYhg3SdrU9pmUvp6rUsZ+afXgZip9DofcBLQ6OzO+TN8iaQPKl+m9bP+Cdn+Znt/2LVAGqKqt635di6Wt/ltOt7PH0H4KXE4ptL6jLn8FeC2lELhdY8lGoDLb05CbgBf3McoskzTcZ7CA1nZRlfRR4PV07Fyuq6939Trbe9blr1GmlLxQ0nMpA/hPbi7aiH5CmT3xC5RuDFCm392Z8jfnrc3EGpPDKYOxnwfsCvxB0ja11Utrx8Do8t9E4IO2v1eXDwS+ZfsXkjYBfgBs3I8QKWDMGY9KekYdTAbbV0rajDLDwXOajTaq9wEvrv32vgmcYnsTST8ATqa9g8BdRzmpew3lj+0XJF1EKWacVAeGaqP5esZcmNzTrPtcSZc0lGlMOvyF+r3AUZL2Be4FLqmv9dLAx5uLNar/oXxpeWDQegEb9D/OLOnql+n7JT1noNtLLcC8GvgF8IJmo42qy9ljaM+0vbUkUWbZ2qSu/2PLPy8upHyRHqpYuXR/o8yy44CjGfrvVJtn9Xkv3TyX6+rrDbCApAXqOd2iA132bP9dM6awbaP1bT930LpplGLX35sINAtWsP39uryHymCk56iMhdfmc4su/03srR08vV6IwvbZkpZoIkRMnD0pA2D2TkU6rVanPtxQplmxAPA4sDB1tGrbN7a8KZnraNu/BX5bs24F7EQZnGiFJsON4ApJu9j+MXCppMm2p9aK/aOjPbhhnfxCbfsm4NWSnk8ZPOkIyof1hQMjtrfUecBDtv8weIOkaxrIMyu6+mX6gzx1lPP7VAbC26GZSGPW5ewxtPlqV5ElgMUlTbJ9fW09ttAoj23S1cD7bf9j8Aa1e6pDgMuArw/VJVXSa4bYv026eC7X5df7YOAUSQcAp0k6EDiJMr7WJU0GG8Vdkt4C/HzgHEjSfJQZ5e5uNNnoFpS0iO2HAWz/VNKtwOnAYs1GG1GX/yaeKOkIYD/gF5I+RjmX25ShZ8mbI1LAmAMGj/oMMw3ks38DkWbFj4ALJZ1P+YL6FQBJKwBtnoFk8In6o5SZU6aoTGHVVu8FDpT0Ocpc0H+pf7xuqtvarMtfqLF9NeVDBEnbtrx4ge2tRtjW9tlgOvll2valvfdr//E1getsH91MqrHpcvYY1v9RxpIBeA/wI0mmzB7whcZSjW5fhh9zbY8+5pgdHwPuG2bbG4dZ3wZdPZf7GN18vbH9XZVBwj/AjJklnksZJPxLDUYbzY6U98fBkgYKFktTplXdsalQY/QjYENKawagfAerBZk2Ty+/Lx39m2h7b0nvprRwfw6lQLob5X3+9n7lyCwkc0Ctvn7d9h2SJgPHA09QRoJ/11Bf+NpE0guA5wNX2P7baPu3gaTn2m57U7dh1S8Xq1OnUrJ9W8OR5lrD9LE9mDJdG7ZP6m+iec+gL9OtvcIj6afAx+rf8i2AHwJ/p2T/pO0TGg04gi5nj+GpTL0n249JWoDSX/rmgS5aEQO6eC4XzRoYCywzp0TbpYAxB0i63PYL6/JZwKd7B/Kx3daBfIAnm45h+wlJCwHrANfbbnPV/inqQFUHN51jNCpzcN9n+x5JkygDPf2tC7OoDCZp2ba/TyQ9SmleeDszWgVsD5xI6YrUxWlUn/yb00Zd/TI96G/5n4G31Sb7ywO/t73uyEdoTpezx9DU4imHRyJpQ+Dq2upqUUo325cAVwFftn1vowFHoDJzzx9s31VbL3yDMn7EVcAnbE8b8QANmlvO5QZI+rzt/ZrOMRxJywK7AzdTBpfcC3g5paXnl1terO/cNOcAkpa3fUfP/XdQujBfAfzQLf6SK+nZlIH6V6V09fo75TvicC2QWknS6tS/if0slGYa1TljgXplBAYN5ENpatNakt4A3ALcLGk74I+U0ZQvk7RNk9lGIunjg26fAPYbuN90vuFI2pPS9O08Se8FTqOM3XFcm3MDSNpY0tUq02JuKOkMSpPVmyS9rOl8I3g5sChlzItdbO8C3FGXW1u8kPSmYW5vpv2js6/bc5KxD/BK26+hDLz7ueZijWo+zZh28Alq/876s7S9C2aXs8fQ/irpH5K+KGntpsPMgsMpM2xBGbV+KUqT9Ycos2212f49X/i/R5nNYyvgVFqcvavncqNoe7fan1LGXZhM6X6xEuV9/h9aPF24ujvNOZRx7wCoXbHfCVxEmZnpm02FGo3KjI8/oAxM+1LKd8OBWYI2aS7Z6NQzPW3923ImZfbKKbVrSV/kJGbO6OpAPlC+XKxL+YJ3KfBS29dIehZlqqJfNRluBF8ATgGuZMZV9fkpg5212Tsp/ZefRpnS89m2p0taDDifFv8BBr5FGb9gceA3wBtsn6syPdR36dNUSrOqtoZ6LWXE6rOA1k9vXHV5dPb5JC1ZryzM9GW6p9jbRl8AzpJ0EPAn4ASV2XdeTSk2tlmXs8fQLqN8ZuxEOVl8kNIP+Vjb1zcZbBSdnW2Lch4xYA3bA1NKHqEyeF1bdfJcTtJwV59F+VnarKuzBHV1mnOYOd+bgP+x/aCknwEXN5RpLLo64yPMPD3tZ4BNbf9roHUnfSp6tfnEsbPqQD6XUwauW5My9sWalDdlmwfyAcB1+ldJN9q+pq67YaA5Yku9gNK0czHgC7YfkrSz7TYPbAbwuO3/SPovpUp/J0D9A9xsstEtaPtyAEnTbZ8LYPvi2ky4teqAnQdKOgH4dsNxxqrLo7N38su07eMl/ZVy5W9gULaNgGNsn95ouFF0OXsMy/X3f29g7/plY0dKIeBG2y9vNt6wujzb1tmS9qMMoHq2pDfa/oXKLEqt7foCnT2Xu4dSbHnKOGBq/+wMXZ0lqKvTnAMsKmk9So+C+W0/CGUgf0mPNxttVF2cJQhmfk8sYPtf8OQFqb4Nhp8CxpxzIzAVuI3yBr2GcuLY9g9rJM1Xv+C9p2fd/LT4D7DtG4G31OZMZ0j6VtOZxujiWilejFK5PFLSaZTWOlc1mmx0vSdBew3a1tr3Si/b/6bFs2AM8jG6Ozp7Z79Mu0xz9pmmc8yOLmePIQ2eyecC4ILaZbLNMxF1ebat3SkFo4GZtf63tnz5FaU1TGt18VwOOIpyhXeogcx/1ucss2qoWYKgDKTa5otpXZ3mHEo3qYGWyndJWqnmXw54bITHNa2rswQBrFtbSglYuOc1X4iZW6zNURnEcw6Q9FHgdcA5wNaUPpP3UL5kfMj22Y2FG0Xt+3a565zKPesnAa+w/dNGgs2C2v1iX2BDt3x6ydp8/i2UiuaJlOmgdqIUwA4aqCa3kaRtgd/ZfmjQ+ucAb7bdyimsJD2D0rz2CeDzlCmr3kQ58fioM5p/VCpTMO9O+f38LvBW4M2U98p+th9oMN6Iupw9hibpbbbb/iVuWOr4bFuSlqJccWz9DA1zw7lcF6mDswRJWhd4qBa8e9cvCOzgDk67Xf8fFh58ftommstmCZK0NPB823/py/OlgDHxaveRgb5NT2NG36bVgJNtt7lvUzRM0nJdOEHqqtrC5TeUVi9vo4wr8TPgDcBrbG/XXLrhDfGFdEdmFF5a/YW0q1+mJR1PuUq8KLAWZTT544BtgWfYbu0V2C5nj7mfOjBjFUC9qvjowGwG9cr0Sygj7p/aaLhZJOkltts8LsCIJD2v7V/0apFrS2Dluupm4HTb9zQWajZI2tb2lKZzjJWkyfTM5tH29wl0fwZCSSvS8z7vd0E6BYw5oBYwJtt+pPaHO8N16lRJV9hep9mEw6tXSPYCVgFO7b3aI+lg2x9qLNwIOpz7AMq4BnfUP8DHU/4ALwS8y/YfGg04gpr3a5QP6L0oI81vQJkK6n22L2ku3fAk/XWgiFj7Bq/Ws+0S2y9uLNwIuvyFtKvZB94PKu2AbwFWsu16/1LbL2o44rC6nD2GJul5lMGTnwA+Avw/SuH178DOtq9uLt3wJH3O9pfq8trALyljgwl4q+3zG4w3IkmXApvYvlvSpygtaU8BXgVMtT24+2QrqAymPdMqyjhs21DO/TtXyBj8ed02kt5Fad35W8p5EZRz0tdSxmY7qqlsI5H0psGrgIOADwHYPqnvocZI0qso49/dQ5nV7E/AMpSxdd5pu5XjpqjMQPh+4BHg68AnKdk3Ag6z3doB/OuYI4dQZpPqfZ/fA3zQ9l/7kSNjYMwZXe7b9GPgH5RRqt+jMkXj22w/QvnFaquu5n6d7T3r8tcoJ3MX1sHNfkapyLbVwZQP66WBPwP/a/u1kjaj/HFr61SqvWN3DD6h6Fv/vdnwXNs79HwhfU39QnouZZT5NutydmrWUwauwtb7naj+dzl7PMWhlM+JxSlT130G2AV4PWWKz82aizaiNzFjAPOvUbrqnaoyCOm3KVNbt9X8tu+uy2+lzHLwn3rx4WKeOv5TW0wFzqN8QRqwHGW8AFPG2WodSd8ZbhPlXKPN9qbM5nFP78p6IfN8nnq+0RbHAacDtzNjnJ3FKMUuU2ZRbKtvA5u7zN63OvBN2xurzDR3GLB5o+mG1+UZCH8MvH9w4VnSRpQZSNbtR4g2j0TcWbYPpIxjcDplaskf1/XT2z4mA/Ac23va/qXtbSkf0GfWAXHarKu5F9CMaSQXtX0hgO2/U0YmbrMFbZ9q+xjK96ITKQu/p93Tep4saWDE588NrJS0BjMGamut+kV0pi+ktH+kcKCT2af2vFd6B8J7DnB/Y6nGpsvZY2hL2P5V/Zv7qO1jXfyKctWxC5450PXCZRDSVs9YBdwnaaDV7B3M+GxbgHafQ7+FchX6q7ZfbfvVwK11uZXFi2oX4ArgokG3qcB/G8w1FmLoz7MnaPd0pC+n/B5eaHsX27sAd9Tl94zy2KbNb3t6Xb6ROsWn7TOY0b2hjR63/R9Kq4WZZiBsMtQYLTZUqznb51EKX32RFhhziO0rgSubzjEbFu4ZuRrb+0u6mTIg6eLNRhtRV3MfDJxSr+acJulASrV7U+CSJoONwcOSNqc0I7OkN9j+ZW3S19rpq2x/XtLzJK0MnO86/oLtayX9qOF4I5kqaXHbD3TwC2kns9t+r6QNJLm2jFqb0r/5GkrrutbqcvYYVm8LscFX6No8s8SzVaZNFrCKpKd5xuB6bZ8y8APA0bUrye2Uv2XnAC8EvtxoshHY/rmk04EvSnoP8AnaXSwecCFlUMM/D94gad/+x5kl+1NmlvstpcskwGqULiRfbCzVKOrnw2uBPSSdRWnZ1YX3CpTfx8MoLdK2Bc6GJ8fdanOL2i7PQHiqpN9QWhQNvM9XBd4FnNavEBkDI2Yi6avAb23/btD6LYHv2l6zmWQj62puAEmbAB9kxvSSN1H6CB9uu7XTQKmMXP1VytWF/6X8DDtT+sS9b6gTkDaQtAdlQMmrKSOEf9T2yXXbxbYH9x1ujdrkeqgvpE+2amirLmaXtA+wFeX38gzKLEFnUU5IT7e9f4PxRtTl7DE0Se8HjvagQW9r67HdbX+skWCjqEXtXhfZfqAOAre97YOayDVWKjMabM6Mz+hpdGhgxtpn/ZvAOrZXaDrPSCQtCzzsFs8eMZLaXWQLnjqI593DP6o96oWdb1HG8Xt203lGozJTyvso3TEupZw3Py5pUeDptm9oNOAw9NQZCDegDCrf+hkIASRtBWzHzO/zKbZP6VuGlp43RgtJ2mWgO0yXJHf/tTm7yiC7L6sn0JMoHx4/sX2gegb4bJsufyHtavb6XnkxpTvXrcAqtu+rJ0fnu8UDYXY5e0RMrDr+0BK272s6S0TEeLW5/160zxeaDjCbkrv/2px9vp5uI9cDmwBbSfom7e6nuj2wMfBK4MOU8XW+SLna89Ymg41BV7M/ZvvxejXwnwMn/7Xv6hPNRhtVl7PHECS9sV6hRtIKko6SdLmk4ySt0nS+4Uh6Uc/ygpI+J2mKpC/Xpt6tJWlJSQdI+omknQZtO7ipXLOqtnKb2nSO0UhaXNJ+kq6UdK+k6ZLOk/TuprONRy0ot5Kk+SS9R9JvJF0q6WJJx9bWwa1Wfz//r/5+vm3Qttb+ftbX+HO1G+1cQ9Kh/XqujIERM5F02XCbgBX7mWVWJHf/dTj7bZJe7DrNa22J8XrKNLAvbDTZyB6z/TjwkKSZvpBKavsX0q5m/29Pf/31B1ZKWor2FwG6nD2Gtr/ttevy9yizTHwWeA1lZPjXNhVsFEcAA13zDqDMhvENyhSw36f0nW6rwTOcbU8HZjiTdD8zxjEYKMw/bWC97SWbSTaqo4FfUIrbO1DGCDgW+Jyk59r+bJPhRqKnTkf65CbgGf3MMosOA24A/o9yseE+4I+U1/yFtr/bZLhRdHUGwmUos+qcJelW4BjgONv/bjTVGAwU0YfaBGzdtxzpQhK9JN1G+eAY3F9PwJ9tP7P/qUaX3P3X1ez1SuVjtm8dYtvGtv/UQKxRqUzL/GrbD6lnwNr6hfSslo/d0cnskhauJ0KD1y8PrGS7zVfVOps9hibpGttr1eWLbPcWpi6x/eLGwo2gt2uepEuAl9p+tHZruLTN3ZkGv66S9qacpG8LnNHiv13foXxB+pTt2+q6f9levdFgo5B0qe11e+5faPulkuYDrrL9vAbjjUjSo5QCzFBfrLa3vUSfI42JpMt6fwclnWd7I0kLA5fYfn6D8UbU4d/PJ8dbk/Q/lJkr30QZm+0Y231ryTCrJD1OKXj1tlh2vb+y7b4MKJ0WGDHYr4HFB65O95J0dt/TjF1y918ns9ueNsK2VhYvqlcOfCEdKABUC1IGT22zTmYfqgBQ199BmVKxtbqcPYZ1tqT9KFdKz5b0Rtu/kPRq4N6Gs41kKUlvpHRbXtj2o1CaAUhq+1W0Ts5wZvsjktYHjpH0S0qLnba/1gAPSnqF7XMlbQvcBeVzoxa82uwy4Ou2rxi8QdJrGsgzVo9Keo7tf0p6CXW6WtuP5PdzzrP9R+CPKgPMv5bSrba1BQzgOmAz2zcO3iDppiH2nyPSAiMiIiKi5VRG3N8bGJiKeBXgQeBXwJ5DnVC2gaTBAzrvafs2Sc+gzKqyWRO5xkIdnuEMyvgGlFm33gI8p60tIweojJfyI2BN4ErgPbb/LmkFYCfb32k04AjqlfQbhvliN9l2K8cgkbQppZvXfylTj+5o+/z6mn/K9qebzDeSrv5+SjrW9o5N55gdkj4MnGv70iG27dGvLkcpYERERER0SO1+tYDtO5vOEu0naSVgPfdxmsPojtq6ZbnaQi+i9dKFJCIiIqIDJL0SuM32NZI2lvQy4Grbv2k620gkbUDpNXKhpLWBLYG/de0LtaRXABsAV9j+bdN5RiLpecB2wMp11c11HIyrG4w129Ti6dmhzBIE/MH2XbX1wjeA9YCrgE+M1H21BdYCtpP05HsFmNL290odUHJ34N+UwUg/C7yMMpbEl20PHqOtNSRtQRnIuPc1P9n2aY2Fmk2SjrLd18GY0wIjIvpO0p9tv3wW9t8E+KTt18+xUBERLSbp25QvzwsApwObAacCrwL+avtTzaUbnqR9gK0ouc8ANgTOovT3Pt32/g3GG5GkC2xvUJffR5kG+hfA5sCvbB/QZL7hSPoMZWDAY4GBL86rADsCx7Y190gk3Wh7taZzDEfSVQOzBEk6jjJL0AmUWYLebruVswR1+b0i6RTgcmBJ4Pl1+XjK35Z1bW/XYLxh1b/lzwWOYubX/F3AP2x/tKFoo5I0ZfAq4NXAmQC2t+1LjhQwIqLtUsCIiHmdpCuBdYBFKVfrVq4z+yxIKWCs02jAYUi6HHgxsDBwK7CK7fskLQqc3/JZSHpnULkQ2Nr2dEmLAefZbuXU25L+DrxgYMDUnvULAVe2eGyAkaZnf67thfuZZ1Z0eJagTr5XYMbrWrvATLO98uBtzaUbnqS/237uEOsF/L3lr/nFlFZFP2LG7CPHUApe2P5DP3LM148niYjoJemB+u8mks6WdKKkv0k6emCkcUlb1nUXU6aXGnjsYpIOl3SBpL9K2q6uP1DS5+vyFpLOqQOYRUTMDexy1WlgJp+BK1BP0O7zucdsP277IeCftu8DsP0fZvwsbTWfpGUkLUe56DcdwPaDwGPNRhvRE8BQA3auRLtf8xUpV6G3GeLW9vFezpa0Xy3MnV27lNCBWYK6+l6B+vsJrAosLmkSQP197ct0nrPpYUkvHWL9S4GH+x1mFk0GLqIMKH2v7bOB/9j+Q7+KF5AxMCKieesBL6D0YfwTsLGkqcAPgU2Ba4HjevbfGzjT9nskLQ1cIOl3wF7AhZL+CHyHcqWs7R++ERFj9Zv6920RytWv4yWdR+lCck6jyUb2X0lPqwWM3qvSS9H+L0hLUU7WBVjSSrZvkbR4XddWHwN+L+kfwMDUhqsBa1DGDGirTk7PXu1OOT+5pt7/X0kDswS9s7FUo/sY3XyvQJlS+m91+T3Aj+rUr2sDX2gs1ejeDRwiaQlmdCFZlVLoendDmcaknld/S9IJ9d/baKCekC4kEdF3kh6wvXjtGrL3QN9QSYdQihhXAN+x/cq6fltgN9uvr8WNRZhx9WtZYAvbV0t6OeVE/n/7NZVTRES/1EE7bfs8Sc8B3gjcCJzY1oKtpIVtPzLE+uWBlWxf3kCscZH0NGBF2/9qOstwagvEDZh5kMALbT/eXKp5Q9dmCerye0XS/JTvs49JWoDSXe1m27c0m2x0KlNJP/ma2761yTyzQ9LrgI1tf7afz5sWGBHRtN4T28cZ/e+SgDfbvmaIbS+kNDNt9Vz3ERGzw/Zfepb/Kelw23c1mWk0wxQvlq1TNnZy2sY69sj0pnOMpBa0zms6x7xC0otsXwZgu81dRp5i8HtF0odsd+K9M6jIsgilVdd/GoozJnV8kUdrweLW2s1oE0lXdmEWEkmrAffZvge4ElhU0jq2r+hXhjb3mYyIedffgEn1CiOUEbIHnA7s0TNWxsAAa88CPkHpkrKVpA37mDciYo5SmTb1aklXStpQ0hmUbnM31ZYZrSTpcz3La9dBAy+SdH3H/05f1XSA4Uh6kaTz6nvj0DpOwMC2C5rMNpKu5q7+Kukfkr6oMlVwJ0j6+OAbsF/PcmtJOrhn+RWU38lvAJdL2rqxYKO7EFgaQNKngP0pgzN/QtL/NZhrVJL2BP4AnCfpvcBplFmmjuvn+yUtMCKidWw/LGk3Sp/vh4A/AkvUzV8Evg1cVps9/kvSNpQ5wD9p+9+SdgWOkPRS220fECkiYiy+BewALA78BniD7XMlvQT4LrBxk+FG8CbgS3X5a8BHbZ8qaQPK3/IxT6ndbyOckIvy/9BWBwP7Uq6qvxc4V9K2tv8JLNhksFF0NTfAZZSxLnYCptTxL46hTEV6fZPBRvEF4BTKlfSBcV3mZ8Y5V5tt1LP8RcrfxIslPZsyneopzcQa1fy2767LbwX+x/Z/JB0AXEwZ062t3kkZY+RpwPXAs3tmZjof+GY/QqSAERF9Z3vx+u/ZwNk963fvWT4NeN4Qj/0P8P4hDvuann0uonQniYiYWyw4MF6EpOm2zwWoJ+yLNhttzJ5p+1QA2xd0IPeXKUWXoWYcaXMr5iV6mqJ/XdJFwGmS3smM2WvaqKu5oYxNcwVlIM+9a4FuR0oR5kbbbS3UvYDSamEx4Au1e9TOtts8COZQlrR9MYDt69TuWeju6+lycQel68t/KN/L25wb4PFabPkvJfOdUGZmqg2j+yIFjIiIiIj26z2xHXyFrs1TBj5b0hTK1d1VemYkgfZfVb8Y+GUtis+kNp9uLUlLDYzFYPssSW8Gfk4Z+Lq1upqbQbPS2L6AMkvaJ4BXNhNpdLZvBN6iMiX9GZK+1XSmWfA8SZdRXvtJkpaxfXctXrT5b+IHgKMlXQrcDkyVdA7lwtuXG002uosl/YxS8Po9cKSk0yizBvatW11mIYmIiIhouTob0+96vvwPrH8OZWDjrzaTbGSSXjVo1cW275e0IrC97YOayDUWktYC7rL9lAE7Ja1o+7YGYo1K0tuA6wYPxFgH3/t/tt/XTLKRdTU3lOy2f9Z0jvGo3QD2BTYcmAWuzerYZ71usf3fOsPRK22f1ESusaizp2wOPJfSoGAacHodGLO16kwvb6G0iDoR2JDSbepG4CDbD/YlRwoYERERERERIWm5rkwBO1iXs8fYtb2fTURERMQ8T9JSkg6Q9DdJd0m6s85KcoCkpZvON5yu5gaQ9AxJh0g6SNJykvaVdLmk4yWt1HS+4SR3/0nasmd5KUmHSbpM0s9qa6NWqr+Hy9flyZKuo8wwccMQradaZZjs57c9u6SLJX1OM2ba6wxJi0vaT2U2rHslTVeZOWjnfuZIASMiIiKi/Y4H7gY2sb2s7eWAV9d1xzeabGRdzQ1wBKVf903AWZRB67amzIz1/eZijeoIkrvfescu+AZwC7ANZcrMHzSSaGxeZ/uOuvw14K221wReS/k52myo7GvQ/uzLUKZRPUvSBZL+V9IzG840VkcD1wFbUGaw+Q5lZpJNJfVt/I50IYmIiIhoOUnX2F5rVrc1rau5AST91fZ6dflG26v1bLvE9osbCzeC5O4/SRfbfkldnilrm7NLuhp4oe3HJJ1ne6OebZfbbu2Mbl3NPui98j+UMSTeBFwNHGP70CbzjUTSpbbX7bl/oe2X1oFTr7L9lNkD54S0wIiIiIhovxskfbq3ObqkFSV9hnLFuq26mhtmPk8+aoRtbZPc/fd0SR9XmXVkSWmmOSXbnP1g4BRJm1KmrD1Q0qskfQG4pNloo+pydgBs/9H2h4CVga8AL2s40mgelPQKeHJg6bsAbD/BoJl45qRMoxoRERHRfm8F9gT+IOnpdd1twBRgh8ZSja6ruQFOlrS47Qdsf25gpaQ1gL83mGs0yd1/PwSWqMtHAssD0yU9gxZ/mbb9XUmXAx9kxowYawK/BL7UYLRRdTj7U97Lth8HTqu3Nvsg8ENJawJXAu8BkLQC0LcZpdKFJCIiIiIiIiJar81NmiIiIiICkPQRSas0nWNWdTU3dDd7cvdfV7N3NTd0N3vNvWrTOWZHW17ztMCIiIiIaDlJ9wIPAv8EjgFOsD292VSj62pu6G725O6/rmbvam7obvau5ob2ZE8LjIiIiIj2uw5YBfgisD5wlaTTJO0saYmRH9qoruaG7mZP7v7ravau5obuZu9qbmhJ9rTAiIiIiGi53qn36v0Fga0oU/C9xvYKjYUbQVdzQ3ezJ3f/dTV7V3NDd7N3NTe0J3sKGBEREREtJ+mvttcbZtvTbD/U70xj0dXc0N3syd1/Xc3e1dzQ3exdzQ3tyZ4CRkRERETLSXqu7bZPJfkUXc0N3c2e3P3X1exdzQ3dzd7V3NCe7ClgRERERHSAJAEbACvXVTcDF7jlJ3NdzQ3dzZ7c/dfV7F3NDd3N3tXc0I7sKWBEREREtJykzYGDgX9QThihDKa2BvAh279tKttIupobups9ufuvq9m7mhu6m72ruaE92VPAiIiIiGg5SVcDW9m+ftD61YFTbD+/kWCj6Gpu6G725O6/rmbvam7obvau5ob2ZM80qhERERHttwAwbYj1NwML9jnLrOhqbuhu9uTuv65m72pu6G72ruaGlmRfoF9PFBERERGz7XDgQknHAjfVdasCOwKHNZZqdF3NDd3Nntz919XsXc0N3c3e1dzQkuzpQhIRERHRAZKeD2zHzIOnTbF9VXOpRtfV3NDd7Mndf13N3tXc0N3sXc0N7cieAkZEREREREREtF7GwIiIiIhoOUlb9iwvJelHki6T9DNJKzaZbSRdzQ3dzZ7c/dfV7F3NDd3N3tXc0J7sKWBEREREtN+Xe5a/AdwKbANcCPygkURj09Xc0N3syd1/Xc3e1dzQ3exdzQ0tyZ4uJBEREREtJ+li2y+py5fYfnHPtpnut0lXc0N3syd3/3U1e1dzQ3ezdzU3tCd7ZiGJiIiIaL+nS/o4IGBJSfKMq1BtblHb1dzQ3ezJ3X9dzd7V3NDd7F3NDS3J3vYXKSIiIiLgh8ASwOLAkcDyAJKeAVzSXKxRdTU3dDd7cvdfV7N3NTd0N3tXc0NLsqcLSURERESHSdrF9o+bzjGrupobups9ufuvq9m7mhu6m72ruaG/2VPAiIiIiOgwSTfaXq3pHLOqq7mhu9mTu/+6mr2ruaG72buaG/qbPWNgRERERLScpMuG2wS0duq9ruaG7mZP7v7ravau5obuZu9qbmhP9hQwIiIiItpvRWAL4O5B6wX8uf9xxqyruaG72ZO7/7qavau5obvZu5obWpI9BYyIiIiI9vs1sLjtSwZvkHR239OMXVdzQ3ezJ3f/dTV7V3NDd7N3NTe0JHvGwIiIiIiIiIiI1ss0qhERERERERHReilgRERERERERETrpYARERERERGdJmmWBhGUtImkX8+pPBExZ6SAERERERERnWb75U1niIg5LwWMiIiIiIjoNEkP1H83kXS2pBMl/U3S0ZJUt21Z110MvKnnsYtJOlzSBZL+Kmm7uv5ASZ+vy1tIOkdSvj9FNCjTqEZERERExNxkPeAFwL+BPwEbS5oK/BDYFLgWOK5n/72BM22/R9LSwAWSfgfsBVwo6Y/Ad4CtbT/Rvx8jIgZLBTEiIiIiIuYmF9ieVosNlwCTgOcB/7L9D9sGftqz/+bAnpIuAc4GFgFWs/0Q8D7gDOB7tv/Zt58gIoaUFhgRERERETE3eaRn+XFG/84j4M22rxli2wuBO4FnTlC2iBiHtMCIiIiIiIi53d+ASZKeU+/v1LPtdGCPnrEy1qv/Pgv4BKVLylaSNuxj3ogYQgoYERERERExV7P9MLAb8Js6iOftPZu/CCwIXCbpSuCLtZhxGPBJ2/8GdgV+JGmRPkePiB4qXcAiIiIiIiIiItorLTAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiIiIiovVSwIiIiIiIiIiI1ksBIyIiIiIiIiJaLwWMiIiIiIiIiGi9FDAiIiIiImKWSXq3pHN77j8g6dmjPGaSJEtaYJzPfb2k14znGPU4o2aOiPZIASNiHtHkScZEyUlGREREe9le3PZ1TeeYFROVWdJrJZ0l6X5Jd0q6RNJnJC0yETkjokgBI2IeNS+fZDRJ0iaSpjWdIyIiIiaGpLcAJwI/A55lezngrcAqwKrDPKYVF4ciuiYFjIiYK0iav+kMERERcytJq0o6SdL02sLge0PsY0lr1OVFJX1D0g2S7pV0rqRFh3jMm2t3kHVGef531mPdKWnvQdvmk7SnpH/W7cdLWrZuO1XS7oP2v1TSm2Yls6SNJP1Z0j318ZvU9QK+Cexn+4e27wKwfY3tPWz/o+63r6QTJf1U0n3AuyU9U9IUSXdJulbS+3oyHiHpSz33Z7oAUl+zvSRdJeluST9Oa4+YF6SAETEXasFJxpAf8nXb2ZK+KOlPtZnlbyUtX7fNyknGEZIOkXSKpAeBV0t6fj3+PZKulLRtz3GOkHSQpN/U5z1f0nMGvR4fkvSPuv2Lkp5Tf4776snQQj37v16leeg9dZ8X9Wy7XtInJV1WX8/jJC0iaTHgVOCZKt1hHpD0zJFey4iIiKapXCT4NXADMAlYGTh2lId9HVgfeDmwLPBp4IlBx90F+ArwGttXjPD8awOHAO8EngksR2ndMGAP4A3Aq+r2u4GD6rZjgJ0GHetZwG/GmlnSynX/L9X1nwR+LmkFYK2a5ecjvBYDtqO01FgaOJryGk6rmbcHvixp0zEcZ8DbgS2A5wDPBT43C4+N6KQUMAaRdLik2yUN+0d00P471MrnlZJ+NqfzRYymBScZI33ID3gbsAvwdGChug/M2knGwHH2B5YAzgd+Bfy2HncP4GhJa/XsvyPwBWAZ4Nr62F5b1Ndho/oaHAq8g9L8c52BbJLWAw4H3k85ifoBMEXSwj3H2gHYElgdeBHwbtsPAlsB/67dYRa3/e9hfraIiIi22IDyJftTth+0/bDtc4fbWdJ8wHuAj9q+2fbjtv9s+5Ge3T4GfArYxPa1ozz/9sCvbZ9Tj/H/mPk85QPA3ran1e37AturdNP4BfBiSc+q+74dOGlQltEyvwM4xfYptp+wfQYwFdgaWL4e4taeYx1bL3A8JOmdPU/zF9u/tP1EfdzGwGfq63kJ8CPgXaO8Fr2+Z/um2upjf3rOoSLmVilgPNURlC8do5K0JrAXsLHtF1D+EEc0remTjJE+5Af82Pbfbf8HOB54cV0/ppOMHifb/lM9EXgxsDhwgO3/2j6TUsjp/TD/he0LbD9GufLx4kHH+6rt+2xfCVwB/Nb2dbbvpbScWK/utxvwA9vn19frSOARSuFjwHds/7ueVPxqiOeKiIjoilWBG+rn51gsDywC/HOEfT4FHGR7LONCPRO4aeBOvSBwZ8/2ZwG/qEWDe4CrgceBFW3fT7kQsmPddyfKOcCsZH4W8JaB49fneAWwUk+OlXry7Wh7aeBioLeL6009y88E7qr5BtxAufA0Vr3Hu6EeM2KulgLGILbPAe7qXVebkZ8m6SJJf5T0vLrpfZQ/vHfXx97e57gRQ2n6JGOkD/kBt/YsP0QpPDALJxkDBp8I3FSLGQMGnwgM+bw9butZ/s8Q9wf2fxbwiUE/46rMfOIw2nNFRER0xU3Aahr7wJN3AA9TujYMZ3Pgc5LePIbj3ULPYJiSnkZpAdmbbyvbS/fcFrF9c91+DLCTpJdRznnOmsXMNwE/GXT8xWwfAFwD3Ay8aQw/h3uW/w0sK2mJnnWr1WMBPAg8rWfbM4Y4Xu8AoavVY0bM1VLAGJtDgT1sr09p6n5wXf9c4Lm1L/95ksbUciNiDmv6JGOkD/mxGMtJxoDBJwKr1hYlA3pPBCbSTcD+g37Gp9k+ZgyP9ei7REREtMoFlCLCAZIWq+M6bTzczvViwuHAN1UGqpxf0ssGdbW8ktLq+aDeMauGcSLwekmvqONR7cfM32O+D+w/0IJT0gqStuvZfgrl4sN+wHGDLnaMJfNPgW0kbVHXL6IyqOYq9XGfAPaR9D5Jy6hYE1hxhNfoJuDPwP/V470I2LU+F8AlwNaSlpX0DIZu6f1hSauoDFi6N3DcSC9ixNwgBYxRSFqcMi7ACZIuofR1H7iSvACwJrAJ5UrxDyUt3f+UETNp+iRj2A/5MeYf9SRjGOdTWjp8WtKCKgOHbsPo43/Mjh8CH5C0YT1JWUzS6wZdRRnObcBykpaaA7kiIiImnO3HKZ+pawA3UgaefOsoD/skcDlwIaV181cY9N3D9qXA6ynn0FuN8PxXAh+mTFN6C2WQzt5WoQcCU4DfSrofOA/YsOfxjwAnAa+px5ilzLXYsB3wWWA65ULGpwZ+HtvHUca+ekfddgeli+yhwAkjPN9OlPHK/k3pRruP7d/VbT8BLgWup4zvNVRx4md123WUlrRfGmKfiLlK5h8e3XzAPbZfPMS2acD5th8F/iXp75SCxoV9zBcxE9uPS9oG+A7lJMOUD7iLR3jYJ4H/o7x3F6d8YG4x6LiXSno98BtJj9o+dZjnv6le9fgqpTXF45SiygfHmP8RSSdRxuX47FgeUx/33/pzH0wZm+Zm4F22/zbWY8zCc01Vmerse5Tf+f8A5wLnjOGxf5N0DHBdHXB17QzkGRERbWf7RspMH4Md0bOPepb/Q2k18LFB+18P9O43lRFaKvTsdyRwZM+q/Xu2PUGZyvSbIzx+V0oLh8Hrx5IZ2+dTZjkZ7vinAaeNsH3fIdZNoxRwhtr/YZ5aJPrWoPsX2v6/4Z4zYm4kO62ZB5M0iTLS8Tr1/p+Bb9k+QZKAF9Uvc1sCO9neWWUayL8CL7Z957AHj4iIiIiIGAdJ1wPv7WmxETFPSBeSQeqV0b8Aa0maJmlXykwIu0q6lNKUfqBP3enAnZKuovTT/1SKFxERERERs0bS2yU9MMTtyqazRUR7pAVGRMwySW+njAcz2A11SuGIiIiIiIgJlQJGRERERERERLReBvHssfzyy3vSpElNx4iIiOiUiy666A7bK/TjuSStxcyj8T8b+DxwVF0/iTJI4A62765jVx0IbE2Zqejdti+ux9oZ+Fw9zpfqIIFIWp8yMOGilJmRPmrbdarCpzzHcFlzXhERETF7hju3SAuMHpMnT/bUqVObjhEREdEpki6yPbmB552fMuPQhpQpFu+yfYCkPYFlbH9G0tbAHpQCxobAgbY3rMWIqcBkymxNFwHr16LHBcBHKNMznwJ8x/apkr461HMMly/nFREREbNnuHOLDOIZERERXbUZ8E/bN1AG2B6YYvFIZkz3uB1wlIvzgKUlrUSZKvoM23fVVhRnAFvWbUvaPs/lKs9Rg4411HNEREREH6SAEREREV21I3BMXV7R9i11+VZgxbq8MnBTz2Om1XUjrZ82xPqRnuNJknaTNFXS1OnTp8/WDxURERFDSwEjIiIiOkfSQsC2wAmDt9WWE3O0j+xwz2H7UNuTbU9eYYW+DAsSERExz0gBIyIiIrpoK+Bi27fV+7fV7h/Uf2+v628GVu153Cp13UjrVxli/UjPEREREX2QAkZERER00U7M6D4CMAXYuS7vDJzcs/5dKjYC7q3dQE4HNpe0jKRlgM2B0+u2+yRtVGcwedegYw31HBEREdEHmUY1IiIiOkXSYsBrgff3rD4AOF7SrsANwA51/SmUGUiupUyjuguA7bskfRG4sO63n+276vKHmDGN6qn1NtJzRERERB+kgBEREUPa+LsbNx0h+uhPe/yp6QhjZvtBYLlB6+6kzEoyeF9Tplgd6jiHA4cPsX4qsM4Q64d8jvFa/1NHTfQh50oXfe1dTUeIiIiGpQtJRERERERERLReChgRERERERER0XopYERERERERERE66WAERERERERERGtlwJGRERERERERLReChgRERERERER0XopYERERERERERE66WAERERERERERGtlwJGRERERERERLReJwsYkg6XdLukK4bZLknfkXStpMskvaTfGSMiIiIiIiJi4nSygAEcAWw5wvatgDXrbTfgkD5kioiIiIiIiIg5pJMFDNvnAHeNsMt2wFEuzgOWlrRSf9JFRERERERExETrZAFjDFYGbuq5P62uewpJu0maKmnq9OnT+xIuIiIiIiIiImbN3FrAGDPbh9qebHvyCius0HSciIiIiIiIiBjC3FrAuBlYtef+KnVdRERERERERHTQ3FrAmAK8q85GshFwr+1bmg4VERER4ydpaUknSvqbpKslvUzSspLOkPSP+u8ydd9hZyaTtHPd/x+Sdu5Zv76ky+tjviNJdf2QzxERERH90ckChqRjgL8Aa0maJmlXSR+Q9IG6yynAdcC1wA+BDzUUNSIiIibegcBptp8HrAtcDewJ/N72msDv630YZmYyScsC+wAbAhsA+/QUJA4B3tfzuIGZz4Z7joiIiOiDBZoOMDts7zTKdgMf7lOciIiI6BNJSwGvBN4NYPu/wH8lbQdsUnc7Ejgb+Aw9M5MB59XWGyvVfc+wfVc97hnAlpLOBpass5gh6SjgDcCp9VhDPUdERET0QSdbYERERMQ8a3VgOvBjSX+V9CNJiwEr9nQXvRVYsS4PNzPZSOunDbGeEZ7jSZndLCIiYs5JASMiIiK6ZAHgJcAhttcDHmRQV47a2sJzMsRwz5HZzSIiIuacFDAiIiKiS6YB02yfX++fSClo3Fa7hlD/vb1uH25mspHWrzLEekZ4joiIiOiDFDAiIiKiM2zfCtwkaa26ajPgKsoMZAMziewMnFyXh5uZ7HRgc0nL1ME7NwdOr9vuk7RRnX3kXYOONdRzRERERB90chDPiIiImKftARwtaSHKrGO7UC7KHC9pV+AGYIe67ynA1pSZyR6q+2L7LklfBC6s++03MKAnZfayI4BFKYN3nlrXHzDMc0REREQfpIARERERnWL7EmDyEJs2G2LfYWcms304cPgQ66cC6wyx/s6hniMiIiL6I11IIiIiIiIiIqL1UsCIiIiIiIiIiNZLASMiIiIiIiIiWi8FjIiIiIiIiIhovRQwIiIiIiIiIqL1UsCIiIiIiIiIiNZLASMiIiIiIiIiWi8FjIiIiIiIiIhovRQwIiIiIiIiIqL1Gi1gSPrJWNZFRETE3EnSkpKWaDpHREREtF/TLTBe0HtH0vzA+g1liYiIiD6R9FJJlwOXAVdIulRSzgEiIiJiWI0UMCTtJel+4EWS7qu3+4HbgZObyBQRERF9dRjwIduTbD8L+DDw44YzRURERIs1UsCw/X+2lwC+ZnvJelvC9nK292oiU0RERPTV47b/OHDH9rnAYw3miYiIiJZboMknt72XpJWBZ/VmsX1Oc6kiIiKiD/4g6QfAMYCBtwJnS3oJgO2LmwwXERER7dNoAUPSAcCOwFXA43W1gVELGJK2BA4E5gd+ZPuAQdtXA44Elq777Gn7lAkLHxEREeOxbv13n0Hr16OcC2w63AMlXQ/cTzl3eMz2ZEnLAscBk4DrgR1s3y1JlPOFrYGHgHcPFEck7Qx8rh72S7aPrOvXB44AFgVOAT5q28M9x2z99BERETHLGi1gAG8E1rL9yKw8qA72eRDwWmAacKGkKbav6tntc8Dxtg+RtDblBGTSxMSOiIiI8bD96nEe4tW27+i5vyfwe9sHSNqz3v8MsBWwZr1tCBwCbFiLEfsAkykFk4vqucTddZ/3AedTzh+2BE4d4TkiIiKiD5ouYFwHLAjMUgED2AC41vZ1AJKOBbajtOQYYGDJurwU8O/xRY2IiIiJIunzQ623vd9sHnI7YJO6fCRwNqW4sB1wlG0D50laWtJKdd8zbN9V85wBbCnpbGBJ2+fV9UcBb6AUMIZ7joiIiOiDpgsYDwGXSPo9PUUM2x8Z5XErAzf13J9GuarSa1/gt5L2ABYDXjPutBERETFRHuxZXgR4PXD1GB9ryme8gR/YPhRY0fYtdfutwIp1eahzhpVHWT9tiPWM8BwRERHRB00XMKbU25ywE3CE7W9IehnwE0nr2H6idydJuwG7Aay22mpzKEpERET0sv2N3vuSvg6cPsaHv8L2zZKeDpwh6W+Dju1a3JhjhnuOnFdERETMOU3PQnKkpEWB1WxfMwsPvRlYtef+KnVdr10pfVax/RdJiwDLA7cPynAocCjA5MmT5+jJTkRERAzraZTP81HZvrn+e7ukX1C6lt4maSXbt9QuIgOf98OdM9zMjO4gA+vPrutXGWJ/RniO3mw5r4iIiJhD5mvyySVtA1wCnFbvv1jSWFpkXAisKWl1SQtRZjIZ/Lgbgc3qcZ9PaZ46fYKiR0RExDhIulzSZfV2JXAN8O0xPG4xSUsMLAObA1dQzgN2rrvtDJxcl6cA71KxEXBv7QZyOrC5pGUkLVOPc3rddp+kjeoMJu8adKyhniMiIiL6oOkuJPtSrpqcDWD7EknPHu1Bth+TtDvl5GN+4HDbV0raD5hqewrwCeCHkv6X0lf23XUAr4iIiGje63uWHwNus/3YGB63IvCLUltgAeBntk+TdCFwvKRdgRuAHer+p1CmUL2WMvbWLgC275L0RcpFEYD9Bgb0BD7EjGlUT603gAOGeY6IiIjog6YLGI/avreehAx4Yride9k+hXJS0rvu8z3LVwEbT0TIiIiImFi2b5C0LvA/ddU5wGVjeNx1wLpDrL+T2vJy0HoDHx7mWIcDhw+xfiqwzlifIyIiIvqj0S4kwJWS3gbML2lNSd8F/txwpoiIiJjDJH0UOBp4er0dXWcOi4iIiBhS0wWMPYAXUKZQPQa4D/hYk4EiIiKiL3YFNrT9+dqCciPgfQ1nioiIiBZrehaSh4C96y0iIiLmHQIe77n/eF0XERERMaRGCxiSJgOfBSb1ZrH9oqYyRURERF/8GDi/ToMK8AbgsObiRERERNs1PYjn0cCngMsZ4+CdERER0W2S5gPOo8xC9oq6ehfbf20sVERERLRe0wWM6XXK04iIiJhH2H5C0kG21wMubjpPREREdEPTBYx9JP0I+D1lIE8AbJ/UXKSIiIjog99LejNwUp3qNCIiImJETRcwdgGeByzIjC4kBlLAiIiImLu9H/g48JikhykDeNr2ks3GioiIiLZquoDxUttrNZwhIiIi+sz2Ek1niIiIiG5puoDxZ0lr276q4RwRERHRB5LmBxa1/UC9vxGwUN38V9v3NxYuIiIiWq3pAsZGwCWS/kUZA2Og+WimUY2IiJg7fQW4HfhqvX8McAWwCGVAz880lCsiIiJarukCxpYNP39ERET012bAS3vu32N7G0kC/thQpoiIiOiA+Zp8cts3ANOARymDdw7cIiIiYu40n+3Heu5/BkrzS2DxZiJFREREFzTaAkPSHsA+wG3MPAtJupBERETMnRaStMTAWBe2fwsgaSlKN5KIiIiIITXdheSjwFq272w4R0RERPTHD4HjJH3A9o0Akp4FHAL8qNFkERER0WqNdiEBbgLubThDRERE9IntbwJTgHMl3SnpLuAc4Fe2vz6WY0iaX9JfJf263l9d0vmSrpV0nKSF6vqF6/1r6/ZJPcfYq66/RtIWPeu3rOuulbRnz/ohnyMiIiL6p+kCxnXA2fUk4uMDt4YzRURExBxk+/u2VwMmAc+y/Szbh8zCIT4KXN1z/yvAt2yvAdwN7FrX7wrcXdd/q+6HpLWBHYEXUAYUP7gWReYHDgK2AtYGdqr7jvQcERER0SdNFzBuBM6gzP++RM8tIiIi5mKSVgS+DRxf768tadSigKRVgNdRu5vU2Us2BU6suxwJvKEub1fvU7dvVvffDjjW9iO2/wVcC2xQb9favs72f4Fjge1GeY6IiIjok0bHwLD9BQBJi9f7DzSZJyIiIvrmCODHwN71/t+B44DDRnnct4FPM+OCx3KUqVgHZjaZBqxcl1emdFfF9mOS7q37rwyc13PM3sfcNGj9hqM8x0wk7QbsBrDaaquN8qNERETErGi0BYakdST9FbgSuFLSRZJe0GSmiIiI6IvlbR9PnYWsFgceH+kBkl4P3G77oj7kmy22D7U92fbkFVZYoek4ERERc5WmZyE5FPi47bMAJG1CGZ385Q1mioiIiDnvQUnLUaZPR9JGjD6w98bAtpK2pky5uiRwILC0pAVqEWQV4Oa6/83AqsA0SQsASwF39qwf0PuYodbfOcJzRMQoNv7uxk1H6Iw/7fGnpiNEtFrTY2AsNlC8ALB9NrDYWB443Cjhg/bZQdJVkq6U9LOJiRwRERET4OOU2UieI+lPwFHAHiM9wPZetlexPYkyCOeZtt8OnAVsX3fbGTi5Lk+p96nbz7Ttun7HOkvJ6sCawAXAhcCadcaRhepzTKmPGe45IiIiok+aboFxnaT/B/yk3n8HZWaSEfWMEv5aSj/UCyVNsX1Vzz5rAnsBG9u+W9LTJzx9REREzBbbF0t6FbAWIOAa24/O5uE+Axwr6UvAX5kxjsZhwE8kXQvcRSlIYPtKSccDVwGPAR+2/TiApN2B04H5gcNtXznKc0RERESfNF3AeA/wBeAkShPSP9Z1o3lylHAAScdSRhS/qmef9wEH2b4bwPbtE5g7IiIixkHSh4GjBwoEkpaRtJPtg8fy+Npq8+y6fB3l3GDwPg8Dbxnm8fsD+w+x/hTglP/P3p/HWVbV9/7/680oTgzSIcggRFsNTgh9ATVXCSg0RsVZ0AgaAvdeIeLXIUKSnxgQo0mUK0a5oqJgVASi11ZRRBRnhkYQBERaHGhEQUYBAcHP74+9ynsoq6qrmqpzdlW/no/HeZy9P3vtvT6naLpWf87aa08Qn7APSZI0PCMrYLRZFJ+uqr9cjdP/sKp4M7ZK+KBHt36+Tfctylur6ksT5OFq4ZIkDd+BVfW+sZ02W/JAYFoFDEmStOYZ2RoYbarm75NsOEddrEN3T+uuwL7AB5NsNEEerhYuSdLwrZ0kYzvti431RpiPJEnquVHfQnIbcEmSM4Hbx4JV9dpVnDfV6uFjVgLntvtpf5LkR3QFjfPvd9aSJOn++hLwqSQfaPv/o8UkSZImNOoCxqfba6b+sEo4XeFiH+Dl49r8X7qZFx9JsindLSWrXCBUkiQNxZvpihb/q+2fCXxodOlIkqS+G2kBo6pOXM3z7plolfAkRwLLq2pZO7ZHksuAe4E3VdUNs5W7JElafVX1e+C49pIkSVqlkRQwkpxSVS9Ncgnd00fuo6qeuKprTLRKeFW9ZWC76J4x//r7n7EkSZoNszEGkCRJa6ZRzcA4tL0/Z0T9S5Kk0XAMIEmSVstIChhVdW3bfBFwclX9YhR5SJKk4XIMIEmSVtfIHqPaPAQ4M8k3kxySZLMR5yNJkobDMYAkSZqRkRYwquqfq+pxwMHA5sDXk3xllDlJkqS55xhAkiTN1KhnYIy5DvglcAPwJyPORZIkDY9jAEmSNC0jLWAkeU2Ss4GzgIcBB7r6uCRJC59jAEmSNFOjegrJmK2A11XVRSPOQ5IkDZdjAEmSNCOjXgPjcOCSJA9PsvXYa5Q5SZKkudfGAA9O8mqAJIuSbDvitCRJUo+NdAZGkkOAtwK/An7fwgU4hVSSpAUsyRHAEuAxwEeAdYH/BJ42yrwkSVJ/jXoRz9cBj6mqx1XVE9rL4oUkSQvfC4DnAbcDVNUv6B6tOqUkD0hyXpLvJ7k0yT+3+LZJzk2yIsmnkqzX4uu3/RXt+DYD1zq8xa9IsudAfGmLrUhy2EB8wj4kSdJwjLqAcTVwy4hzkCRJw3d3VRXdzEuSPGia590F7FZVTwK2B5Ym2QV4J3BMVT0KuAk4oLU/ALipxY9p7UiyHbAP8DhgKfD+JGsnWRt4H7AXsB2wb2vLFH1IkqQhGHUB4yrg7PYNyOvHXiPOSZIkzb1TknwA2CjJgcBXgA+u6qTq3NZ2122vAnYDTmvxE4Hnt+292z7t+O5J0uInV9VdVfUTYAWwU3utqKqrqupu4GRg73bOZH1IkqQhGPVTSH7eXuu1lyRJWgNU1b8neRZwK906GG+pqjOnc26bJXEB8Ci62RI/Bm6uqntak5XAFm17C7oZn1TVPUluoXts6xbAOQOXHTzn6nHxnds5k/UxmNtBwEEAW2/tuuSSJM2mkRYwqmrsvtUHVtUdo8xFkiQNVytYTKtoMe68e4Htk2wEfAZ47Cynttqq6njgeIAlS5bUiNORJGlBGektJEmekuQy4Idt/0lJ3j/KnCRJ0txJ8pskt072msm1qupm4GvAU+huRRn7YmZL4Jq2fQ2wVet7HWBD4IbB+LhzJovfMEUfkiRpCEa9Bsb/BvakGxRQVd8Hnj7KhCRJ0typqodU1UOB9wCH0d2GsSXwZrpxwZSSLGozL0iyAfAs4HK6QsaLW7P9gc+27WVtn3b8q23x0GXAPu0pJdsCi4HzgPOBxe2JI+vRLfS5rJ0zWR+SJGkIRr0GBlV1dbcu1h/cO6pcJEnS0DyvPUlkzHFJvg+8ZRXnbQ6c2NbBWAs4pao+32Z0npzkbcCFwIdb+w8DH0uyAriRriBBVV2a5BTgMuAe4OB2awpJDgHOANYGTqiqS9u13jxJH5IkaQhGXcC4OslTgUqyLnAo3bcokiRpYbs9ySvonvJRwL7A7as6qaouBp48QfwquieIjI/fCbxkkmsdDRw9Qfx04PTp9iFJkoZj1LeQ/E/gYLrpo9fQPc/94FEmJEmShuLlwEuBX7XXS1pMkiRpQqN+CsmvgVdMdjzJ4VX1L0NMSZIkDUFV/RTYe7LjjgEkSdJ4o56BsSoTTvmUJEkLnmMASZJ0H30vYGTSA8nSJFckWZHksCnavShJJVkyNylKkqQ5MOkYQJIkrZn6XsCoiYJt5fH3AXsB2wH7JtlugnYPoVsY9Ny5TFKSJM26CccAkiRpzdX3AsZk377sBKyoqquq6m66Fcwnuo/2KOCdwJ1zlJ8kSZobzsCQJEn3MdICRpJNJohtO7B76iSnbgFcPbC/ssUGr7MDsFVVfWEVORyUZHmS5ddff/30EpckSffL/RgDSJKkNdSoZ2B8LslDx3babSCfG9uvqrevzkWTrAW8G3jDqtpW1fFVtaSqlixatGh1upMkSTM3J2MASZK0cI26gPF2ugHMg5PsSPdty19P47xrgK0G9rdssTEPAR4PnJ3kp8AuwDIX8pQkqTdWdwwgSZLWUOuMsvOq+kKSdYEv0xUdXlBVP5rGqecDi9tU02uAfYCXD1z3FmDTsf0kZwNvrKrls5i+JElaTfdjDCBJktZQIylgJHkv911dfEPgx8AhSaiq1051flXdk+QQ4AxgbeCEqro0yZHA8qpaNle5S5Kk1Xd/xwCSJGnNNaoZGONnQlww0wtU1enA6eNib5mk7a4zvb4kSZoT93sMIEmS1kwjKWBU1YkASR4E3FlV97b9tYH1R5GTJEmae44BJEnS6hr1Ip5nARsM7G8AfGVEuUiSpOFxDCBJkmZk1AWMB1TVbWM7bfuBI8xHkiQNh2MASZI0I6MuYNyeZIexnfYYtd+OMB9JkjQcqzUGSLJVkq8luSzJpUkObfFNkpyZ5Mr2vnGLJ8mxSVYkuXhcn/u39lcm2X8wlySXtHOOTZKp+pAkScMx6gLG64BTk3wzybeATwGHjDYlSZI0BK9j9cYA9wBvqKrtgF2Ag5NsBxwGnFVVi+luTzmstd8LWNxeBwHHQVeMAI4AdgZ2Ao4YKEgcBxw4cN7SFp+sD0mSNASjegoJAFV1fpLHAo9poSuq6nejzEmSJM291R0DVNW1wLVt+zdJLge2APYGdm3NTgTOBt7c4idVVQHnJNkoyeat7ZlVdSNAkjOBpUnOBh5aVee0+EnA84EvTtGHJEkagpEUMJLsVlVfTfLCcYce3Z4B/+lR5CVJkubWbI4BkmwDPBk4F9isFTcAfgls1ra3AK4eOG1li00VXzlBnCn6GMzpILqZHmy99dbT/SiSJGkaRjUD4xnAV4HnTnCsAAsYkiQtTLMyBkjyYOC/gNdV1a1tmYruIlWVpGYh10lN1kdVHQ8cD7BkyZI5zUGSpDXNSAoYVXVEe3/1KPqXJEmjMRtjgCTr0hUvPj4wY+NXSTavqmvbLSLXtfg1wFYDp2/ZYtfw/24HGYuf3eJbTtB+qj4kSdIQjOoWktdPdbyq3j2sXCRJ0vDc3zFAeyLIh4HLx7VdBuwPvKO9f3YgfkiSk+kW7LylFSDOAN4+sHDnHsDhVXVjkluT7EJ3a8p+wHtX0YckSRqCUd1C8pApjjndUpKkhev+jgGeBrwSuCTJRS32D3RFhVOSHAD8DHhpO3Y68GxgBXAH8GqAVqg4Cji/tTtybEFP4DXAR4EN6Bbv/GKLT9aHJEkaglHdQvLPAElOBA6tqpvb/sbAu0aRkyRJmnv3dwxQVd8CMsnh3SdoX8DBk1zrBOCECeLLgcdPEL9hoj4kSdJwrDXi/p84NnABqKqb6FYTlyRJC5tjAEmSNCOjLmCsNXDvKUk2YXS3tUiSpOFxDCBJkmZk1AOFdwHfTXJq238JcPQI85EkScPhGECSJM3ISAsYVXVSkuXAbi30wqq6bJQ5SZKkuecYQJIkzdSoZ2DQBisOWCRJWsM4BpAkSTMx6jUwJEmSJEmSVskChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6btwWMJEuTXJFkRZLDJjj++iSXJbk4yVlJHjGKPCVJkiRJ0v03LwsYSdYG3gfsBWwH7Jtku3HNLgSWVNUTgdOAfx1ulpIkSZIkabbMywIGsBOwoqquqqq7gZOBvQcbVNXXquqOtnsOsOWQc5QkSZIkSbNkvhYwtgCuHthf2WKTOQD44kQHkhyUZHmS5ddff/0spihJkiRJkmbLfC1gTFuSvwaWAP820fGqOr6qllTVkkWLFg03OUmSJEmSNC3rjDqB1XQNsNXA/pYtdh9Jngn8I/CMqrprSLlJkiRJkqRZNl9nYJwPLE6ybZL1gH2AZYMNkjwZ+ADwvKq6bgQ5SpKkWZbkhCTXJfnBQGyTJGcmubK9b9ziSXJse2LZxUl2GDhn/9b+yiT7D8R3THJJO+fYJJmqD0mSNDzzsoBRVfcAhwBnAJcDp1TVpUmOTPK81uzfgAcDpya5KMmySS4nSZLmj48CS8fFDgPOqqrFwFltH7qnlS1ur4OA46ArRgBHADvTLQx+xEBB4jjgwIHzlq6iD0mSNCTz9RYSqup04PRxsbcMbD9z6ElJkqQ5VVXfSLLNuPDewK5t+0TgbODNLX5SVRVwTpKNkmze2p5ZVTcCJDkTWJrkbOChVXVOi58EPJ9uIfDJ+pAkSUMyL2dgSJIkDdisqq5t278ENmvbkz21bKr4ygniU/VxHz7dTJKkuWMBQ5IkLRhttkWNqg+fbiZJ0tyxgCFJkua7X7VbQ2jvY4t3T/bUsqniW04Qn6oPSZI0JBYwJEnSfLcMGHuSyP7AZwfi+7WnkewC3NJuAzkD2CPJxm3xzj2AM9qxW5Ps0p4+st+4a03UhyRJGpJ5u4inJEla8yT5JN1impsmWUn3NJF3AKckOQD4GfDS1vx04NnACuAO4NUAVXVjkqPoHssOcOTYgp7Aa+iedLIB3eKdX2zxyfqQJElDYgFDkiTNG1W17ySHdp+gbQEHT3KdE4ATJogvBx4/QfyGifqQJEnD4y0kkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeq9dUadgCRJkiRp9nz96c8YdQrzwjO+8fVRp6AZcgaGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6r15+xSSJEuB9wBrAx+qqneMO74+cBKwI3AD8LKq+umw85Rm28+PfMKoU9AQbf2WS0adgqQBqxp/SJKkuTMvZ2AkWRt4H7AXsB2wb5LtxjU7ALipqh4FHAO8c7hZSpKkhWSa4w9JkjRH5mUBA9gJWFFVV1XV3cDJwN7j2uwNnNi2TwN2T5Ih5ihJkhaW6Yw/JEnSHJmvt5BsAVw9sL8S2HmyNlV1T5JbgIcBv56LhHZ800lzcVn11AX/tt+oU5AkDd90xh+SJGmOzNcCxqxJchBwUNu9LckVo8xnHtqUOSoK9Vn+ff9Rp7AmWiP/rHGEE8dGYI38s5bX3q8/a4+YrTzmuwU0rujd/wdrwO/e3v3M1wC9+5nfz7+L+653P28W/gT9/v3Mp2/CscV8LWBcA2w1sL9li03UZmWSdYAN6RbzvI+qOh44fo7yXPCSLK+qJaPOQwuff9Y0LP5Z0xRWOf5YKOMK/z8YPn/mw+fPfLj8eQ/fQvyZz9c1MM4HFifZNsl6wD7AsnFtlgFjpfoXA1+tqhpijpIkaWGZzvhDkiTNkXk5A6OtaXEIcAbdY8xOqKpLkxwJLK+qZcCHgY8lWQHcSDfIkCRJWi2TjT9GnJYkSWuMeVnAAKiq04HTx8XeMrB9J/CSYee1Bpr302Q1b/hnTcPinzVNaqLxxwLl/wfD5898+PyZD5c/7+FbcD/zeFeFJEmSJEnqu/m6BoYkSZIkSVqDWMCQJEmSJEm9ZwFDkiRJkiT1ngUMSb2U5LFJdk/y4HHxpaPKSQtfkpNGnYOkhc/fccOXZKck/61tb5fk9UmePeq81iT+jh2uJH/R/pzvMepcZpOLeGpWJHl1VX1k1HloYUjyWuBg4HJge+DQqvpsO/a9qtphhOlpgUiybHwI+EvgqwBV9byhJyX1jL/fZ5+/44YvyRHAXnRPYDwT2Bn4GvAs4IyqOnqE6S1I/o4dviTnVdVObftAur9nPgPsAXyuqt4xyvxmiwUMzYokP6+qrUedhxaGJJcAT6mq25JsA5wGfKyq3pPkwqp68mgz1EKQ5HvAZcCHgKIbXH0S2Aegqr4+uuykfvD3++zzd9zwtZ/59sD6wC+BLavq1iQbAOdW1RNHmd9C5O/Y4Rv8+yPJ+cCzq+r6JA8CzqmqJ4w2w9mxzqgT0PyR5OLJDgGbDTMXLXhrVdVtAFX10yS7AqcleQTdnzdpNiwBDgX+EXhTVV2U5LcOqrSm8ff70Pk7bvjuqap7gTuS/LiqbgWoqt8m+f2Ic1uo/B07fGsl2ZhumYhU1fUAVXV7kntGm9rssYChmdgM2BO4aVw8wHeGn44WsF8l2b6qLgJo31I9BzgBWBDVY41eVf0eOCbJqe39V/h7UWsmf78Pl7/jhu/uJA+sqjuAHceCSTYELGDMAX/HjsSGwAV0f3dXks2r6tq21s6CKY76h0gz8XngwWO/cAclOXvo2Wgh2w+4T6W4qu4B9kvygdGkpIWqqlYCL0nyV8Cto85HGgF/vw+Xv+OG7+lVdRf84R/WY9YF9h9NSmsGf8cOT1VtM8mh3wMvGGIqc8o1MCRJkiRJUu/5GFVJkiRJktR7FjAkSZIkSVLvWcCQNHRJZrQoXJJdk3x+rvKRJEnzm2MLac1gAUPS0FXVU0edgyRJWjgcW0hrBgsYkoYuyW3tfdckZyc5LckPk3w8SdqxpS32PeCFA+c+KMkJSc5LcmGSvVv8PUne0rb3TPKNJP4dJ0nSGsCxhbRm8DGqkkbtycDjgF8A3waelmQ58EFgN2AF8KmB9v8IfLWq/ibJRsB5Sb4CHA6cn+SbwLHAs8c9Kk2SJK0ZHFtIC5QVREmjdl5VrWwDgouAbYDHAj+pqiure9bzfw603wM4LMlFwNnAA4Ctq+oO4EDgTOA/qurHQ/sEkiSpTxxbSAuUMzAkjdpdA9v3suq/lwK8qKqumODYE4AbgIfPUm6SJGn+cWwhLVDOwJDURz8EtknyyLa/78CxM4C/G7if9cnt/RHAG+imje6VZOch5itJkvrNsYW0AFjAkNQ7VXUncBDwhbbQ1nUDh48C1gUuTnIpcFQbcHwYeGNV/QI4APhQkgcMOXVJktRDji2khSHdLWCSJEmSJEn95QwMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJElDleRVSb41sH9bkj9bxTnbJKkk68x9hpL6yAKGpDnnIEWSJE2lqh5cVVeNOo/Z4BhGmjv+TyVp6KrqwaPOQZIkaVSSrFNV94w6D2m+cQaGJEmSpDmTZKskn05yfZIbkvzHBG0qyaPa9gZJ3pXkZ0luSfKtJBtMcM6Lkvw0yeNX0f9fJPlOkpuTXJ3kVS2+YZKTWl4/S/JPSdZqx96a5D8HrnGfWRVJzk5yVJJvJ/lNki8n2bQ1/0Z7v7nNOn1Km4367STHJLkBODLJjUmeMNDHnyS5I8mimfx8pTWJBQxJs6oHg5TnJbm0DVLOTvLnA8fenOSaNtC4Isnus/GZJUnSxJKsDXwe+BmwDbAFcPIqTvt3YEfgqcAmwN8Dvx933VcD7wSeWVU/mKL/RwBfBN4LLAK2By5qh98LbAj8GfAMYD/g1dP8aAAvb+3/BFgPeGOLP729b9Rujflu298ZuArYDDiK7ufw1wPX2xc4q6qun0EO0hrFAoakWdODQcqjgU8Cr6MbpJwOfC7JekkeAxwC/LeqegiwJ/DTGX1ASVoNSU5Icl2SSf/+Gtf+pUkua8XYT8x1ftIc2wl4OPCmqrq9qu6sqm9N1rjNgPgb4NCquqaq7q2q71TVXQPNXge8Cdi1qlasov+XA1+pqk9W1e+q6oaquqiNWfYBDq+q31TVT4F3Aa+cwWf7SFX9qKp+C5xCVxyZyi+q6r1VdU8750Rg3yRpx18JfGwG/UtrHAsYkmbTqAcpLwO+UFVnVtXv6IojG9AVR+4F1ge2S7JuVf20qn68uh9Ukmbgo8DS6TRMshg4HHhaVT2O7u9AaT7bCvjZDNZ72BR4ADDV7+g3Ae+rqpXT7H+ia20KrEv3pcuYn9F9+TJdvxzYvgNY1RpfVw/uVNW57bxdkzwWeBSwbAb9S2scCxiSZtOoBykPZ2AgUlW/pxssbNGKH68D3gpcl+TkJA+fZp6StNqq6hvAjYOxJI9M8qUkFyT5ZvvHC8CBdH/n3dTOvW7I6Uqz7Wpg6xk8kePXwJ3AI6doswfwT0leNM3+J7rWr4HfAY8YiG0NXNO2bwceOHDsT6fR15iaQfxEuttIXgmcVlV3zqAfaY1jAUPSbBr1IOUXDAxE2pTMrWiDkar6RFX9RWtTdLelSNIoHA/8XVXtSHff/Ptb/NHAo9tif+ckmdbMDanHzgOuBd6R5EFJHpDkaZM1bl8+nAC8O8nDk6zdFsFcf6DZpXSzmt6X5Hmr6P/jwDPbrVnrJHlYku2r6l662z6OTvKQtlbG64GxhTsvAp6eZOskG9LNjJqu6+luh53ykfHNfwIvoCtinDSDPqQ1kgUMSbNp1IOUU4C/SrJ7knWBNwB3Ad9J8pgku7Vr3wn8lnFrbUjSMCR5MN2tbacmuQj4ALB5O7wOsBjYlW5Bvw8m2Wj4WUqzoxUKnkt3e8TPgZV0t3xO5Y3AJcD5dLOX3sm4f7dU1feB59D9P7LXFP3/HHg23ZjgRrrCxJPa4b+jm2lxFfAt4BN04xKq6kzgU8DFwAV0a3xNS1XdARwNfLstKr7LFG2vBr5H98XKN6fbh7SmStVkM5wkaeaSbA0cC/x3ul/Gn6D7xfy3bfYDSQpYXFUr2hNH/gV4Cd29o9+nW2BzM+AnwLpVdU+SJcAXgFdV1Ren6P8FdIOGLegGKa+pqkuTPBH4EPDndFNGvwMcVFW/mOUfgST9kSTbAJ+vqscneShwRVVtPkG7/wOcW1UfaftnAYdV1flDTVjS0CQ5gW6Bz38adS5S31nAkCRJmmODBYy2/x3gmKo6td3u9sSq+n67ZWTfqto/yabAhcD2VXXDyJKXNGfa3w0XAU+uqp+MNhup/7yFRJIkaQ4l+STwXeAxSVYmOQB4BXBAku/T3Sq3d2t+BnBDksuAr9E91cnihTSFJK9IctsEr0tHndtUkhwF/AD4N4sX0vQ4A0PSvJLkFXT3i4/3s/bIQUmSJEkLkAUMSZIkSZLUe9N91OEaYdNNN61tttlm1GlIkjSvXHDBBb+uqkWjzqNvHFdIkrR6JhtbWMAYsM0227B8+fJRpyFJ0ryS5GejzqGPHFdIkrR6JhtbuIinJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfd8jKokzQNH//WLR53CavnH/zxt1ClIvbLjm04adQp/5IJ/22/UKUiSNC3OwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZK0xktyQpLrkvxgkuNJcmySFUkuTrLDsHOUJGlNZwFDkiQJPgosneL4XsDi9joIOG4IOUmSpAFzXsBIslGS05L8MMnlSZ6SZJMkZya5sr1v3NpO+u1Gkv1b+yuT7D8Q3zHJJe2cY5OkxSfsQ5Ikabyq+gZw4xRN9gZOqs45wEZJNh9OdpIkCYYzA+M9wJeq6rHAk4DLgcOAs6pqMXBW24dJvt1IsglwBLAzsBNwxEBB4jjgwIHzxr49mawPSZKkmdoCuHpgf2WLSZKkIZnTAkaSDYGnAx8GqKq7q+pmum8xTmzNTgSe37Yn+3ZjT+DMqrqxqm4CzgSWtmMPrapzqqqAk8Zda6I+JEmS5kSSg5IsT7L8+uuvH3U6kiQtKHM9A2Nb4HrgI0kuTPKhJA8CNquqa1ubXwKbte3Jvt2YKr5ygjhT9HEfDjQkSdI0XANsNbC/ZYvdR1UdX1VLqmrJokWLhpacJElrgrkuYKwD7AAcV1VPBm5n3K0cbeZEzWUSU/XhQEOSJE3DMmC/tl7XLsAtA1+USJKkIZjrAsZKYGVVndv2T6MraPxqbOGr9n5dOz7ZtxtTxbecIM4UfUiSJN1Hkk8C3wUek2RlkgOS/M8k/7M1OR24ClgBfBB4zYhSlSRpjbXOXF68qn6Z5Ookj6mqK4Ddgcvaa3/gHe39s+2UZcAhSU6mW7Dzlqq6NskZwNsHFu7cAzi8qm5Mcmv7JuRcYD/gvQPXmqgPSZKk+6iqfVdxvICDh5SOJEmawJwWMJq/Az6eZD26by5eTTfz45QkBwA/A17a2p4OPJvu2407WltaoeIo4PzW7siqGnvU2Wvont2+AfDF9oKucDFRH5IkSZIkaZ6Z8wJGVV0ELJng0O4TtJ30242qOgE4YYL4cuDxE8RvmKgPSZIkSZI0/8z1GhiSJEmSJEn3mwUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvTfnBYwkP01ySZKLkixvsU2SnJnkyva+cYsnybFJViS5OMkOA9fZv7W/Msn+A/Ed2/VXtHMzVR+SJEmSJGn+GdYMjL+squ2raknbPww4q6oWA2e1fYC9gMXtdRBwHHTFCOAIYGdgJ+CIgYLEccCBA+ctXUUfkiRJkiRpnhnVLSR7Aye27ROB5w/ET6rOOcBGSTYH9gTOrKobq+om4ExgaTv20Ko6p6oKOGnctSbqQ5IkSZIkzTPDKGAU8OUkFyQ5qMU2q6pr2/Yvgc3a9hbA1QPnrmyxqeIrJ4hP1cd9JDkoyfIky6+//voZfzhJkiRJkjT3hlHA+Iuq2oHu9pCDkzx98GCbOVFzmcBUfVTV8VW1pKqWLFq0aC7TkCRJPZVkaZIr2ppaf3TbaZKtk3wtyYVtna5njyJPSZLWZHNewKiqa9r7dcBn6Naw+FW7/YP2fl1rfg2w1cDpW7bYVPEtJ4gzRR+SJEl/kGRt4H10X7ZsB+ybZLtxzf4JOKWqngzsA7x/uFlKkqQ5LWAkeVCSh4xtA3sAPwCWAWNPEtkf+GzbXgbs155GsgtwS7sN5AxgjyQbt8U79wDOaMduTbJLe/rIfuOuNVEfkiRJg3YCVlTVVVV1N3Ay3Vpagwp4aNveEPjFEPOTJEnAOnN8/c2Az7Qnm64DfKKqvpTkfOCUJAcAPwNe2tqfDjwbWAHcAbwaoKpuTHIUcH5rd2RV3di2XwN8FNgA+GJ7Abxjkj4kSZIGTbTW1s7j2ryVbk2vvwMeBDxzogu19b4OAth6661nPVFJktZkc1rAqKqrgCdNEL8B2H2CeAEHT3KtE4ATJogvBx4/3T4kSZJWw77AR6vqXUmeAnwsyeOr6veDjarqeOB4gCVLlszpGl+SJK1pRvUYVUmSpL6YbK2tQQcApwBU1XeBBwCbDiU7SZIEWMCQJEk6H1icZNsk69Et0rlsXJuf02Z2JvlzugKGz1+XJGmILGBIkqQ1WlXdAxxCt2j45XRPG7k0yZFJnteavQE4MMn3gU8Cr2q3vkqSpCGZ60U8JUmShibJE6rqkpmeV1Wn0y0mPhh7y8D2ZcDT7n+GkiRpdTkDQ5IkLSTvT3Jektck2XDUyUiSpNljAUOSJC0YVfXfgVfQLcp5QZJPJHnWiNOSJEmzwAKGJElaUKrqSuCfgDcDzwCOTfLDJC8cbWaSJOn+sIAhSZIWjCRPTHIM3WKcuwHPrao/b9vHjDQ5SZJ0v7iIpyRJWkjeC3wI+Ieq+u1YsKp+keSfRpeWJEm6vyxgSJKkBSHJ2sA1VfWxiY5PFpckSfODt5BIkqQFoaruBbZKst6oc5EkSbPPGRiSJGkh+Qnw7STLgNvHglX17tGlJEmSZoMFDEmStJD8uL3WAh7SYjW6dCRJ0myxgCFJkhaSy6rq1MFAkpeMKhlJkjR7XANDkiQtJIdPMyZJkuYZZ2BIkqR5L8lewLOBLZIcO3DoocA9o8lKkiTNpqHMwEiydpILk3y+7W+b5NwkK5J8amy18CTrt/0V7fg2A9c4vMWvSLLnQHxpi61IcthAfMI+JEnSgvQLYDlwJ3DBwGsZsOcU50mSpHlitQoYSTZO8sQZnHIocPnA/juBY6rqUcBNwAEtfgBwU4sf09qRZDtgH+BxwFLg/a0osjbwPmAvYDtg39Z2qj4kSdICU1Xfr6oTgUdV1YkDr09X1U2jzk+SJN1/0y5gJDk7yUOTbAJ8D/hgklU+kizJlsBfAR9q+wF2A05rTU4Ent+29277tOO7t/Z7AydX1V1V9RNgBbBTe62oqquq6m7gZGDvVfQhSZIWrp2SnJnkR0muSvKTJFeNOilJknT/zWQNjA2r6tYkfwucVFVHJLl4Guf9b+Dv+X+PMnsYcHNVjd2PuhLYom1vAVwNUFX3JLmltd8COGfgmoPnXD0uvvMq+riPJAcBBwFsvfXW0/g4kiSpxz4M/H90t4/cO+JcJEnSLJrJLSTrJNkceCnw+emckOQ5wHVVdcHqJDcMVXV8VS2pqiWLFi0adTqSJOn+uaWqvlhV11XVDWOvUSclSZLuv5nMwDgSOAP4dlWdn+TPgCtXcc7TgOcleTbwALqVwN8DbJRknTZDYkvgmtb+GmArYGWSdYANgRsG4mMGz5kofsMUfUiSpIXra0n+Dfg0cNdYsKq+N7qUJEnSbJh2AaOqTgVOHdi/CnjRKs45nPbs9SS7Am+sqlckORV4Md2aFfsDn22nLGv7323Hv1pVlWQZ8Im25sbDgcXAeUCAxUm2pStQ7AO8vJ3ztUn6kCRJC9fO7X3JQKzo1saSJEnz2LQLGG0xzvfSzaoA+CZwaFWtXI1+3wycnORtwIV096vS3j+WZAVwI11Bgqq6NMkpwGV0z3I/uKrubXkdQjczZG3ghKq6dBV9SJKkBaqq/nLUOUiSpLkxk1tIPgJ8AnhJ2//rFnvWdE6uqrOBs9v2VXRPEBnf5s6B648/djRw9ATx04HTJ4hP2IckSVq4krxlonhVHTnsXCRJ0uyaySKei6rqI1V1T3t9FHDVS0mS1Ce3D7zuBfYCtlnVSUmWJrkiyYokh03S5qVJLktyaZJPzGbSkiRp1WYyA+OGJH8NfLLt70u3WKYkSVIvVNW7BveT/DvdraaTSrI28D66WaUrgfOTLKuqywbaLKZb1+tpVXVTkj+Z9eQlSdKUZjID42/oHqH6S+BaugUyXz0XSUmSJM2SB9I9jWwqOwErquqqqrqbbgHwvce1ORB4X1XdBFBV1816ppIkaUozeQrJz4DnTXY8yeFV9S+zkpUkSdJqSHIJ3VNHoFvgexHdo+CnsgVw9cD+Sv7f00zGPLpd/9vtum+tqi9N0P9BwEEAW2+99UzTlyRJU5jJLSSr8hLAAoYkSRql5wxs3wP8qqrumYXrrkP3GPdd6WZ0fCPJE6rq5sFGVXU8cDzAkiVLCkmSNGtmcgvJqmQWryVJkjRjbcboRsBzgRcA203jtGuArQb2t2yxQSuBZVX1u6r6CfAjuoKGJEkaktksYPgtgyRJGqkkhwIfB/6kvT6e5O9Wcdr5wOIk2yZZD9gHWDauzf+lm31Bkk3pbim5avYylyRJqzKbt5A4A0OSJI3aAcDOVXU7QJJ3At8F3jvZCVV1T5JD6J5WsjZwQlVdmuRIYHlVLWvH9khyGd3jWd9UVT6NTZKkIZp2ASPJJlV147jYtm0aJcCps5qZJEnSzIWuwDDmXqbxJUtVnQ6cPi72loHtAl7fXpIkaQRmMgPjc0n2qqpbAZJsB5wCPB6gqt4+B/lJkiTNxEeAc5N8pu0/H/jw6NKRJEmzZSYFjLfTFTH+CngMcBLwijnJSpIkaTVU1buTnA38RQu9uqouHGFKkiRplky7gFFVX0iyLvBl4CHAC6rqR3OWmSRJ0gwl2QW4tKq+1/YfmmTnqjp3xKlJkqT7aZUFjCTv5b5PGNkQ+DFwSBKq6rVzlZwkSdIMHQfsMLB/2wQxSZI0D01nBsbycfsXzEUikiRJsyBtwU0Aqur3SWbzqWuSJGlEVvkLvapOBEjyIODOqrq37a8NrD+36UmSJM3IVUleSzfrAuA1wFUjzEeSJM2StWbQ9ixgg4H9DYCvzG46kiRJ98v/BJ4KXAOsBHYGDhppRpIkaVbMZErlA6rqtrGdqrotyQOnOiHJA4Bv0M3UWAc4raqOSLItcDLwMLpbUl5ZVXcnWZ/u6SY7AjcAL6uqn7ZrHQ4cQPc899dW1RktvhR4D7A28KGqekeLT9jHDD7vH+z4ppNW57SRu+Df9ht1CpIkDVVVXQfsM9nxJIdX1b8MMSVJkjRLZjID4/Ykf1gAK8mOwG9Xcc5dwG5V9SRge2BpWx38ncAxVfUo4Ca6wgTt/aYWP6a1I8l2dIORxwFLgfcnWbvdxvI+YC9gO2Df1pYp+pAkSWuul4w6AUmStHpmUsB4HXBqkm8m+RbwKeCQqU6oztisjXXbq4DdgNNa/ETg+W1777ZPO757krT4yVV1V1X9BFgB7NReK6rqqja74mRg73bOZH1IkqQ1V0adgCRJWj3TvoWkqs5P8ljgMS10RVX9blXntVkSFwCPopst8WPg5qq6pzVZCWzRtrcArm793ZPkFrpbQLYAzhm47OA5V4+L79zOmayP8fkdRLs3duutt17Vx5EkSfNbrbqJJEnqo1UWMJLsVlVfTfLCcYcenYSq+vRU57enlmyfZCPgM8BjVzvbOVBVxwPHAyxZssRBjSRJC5szMCRJmqemMwPjGcBXgedOcKyAKQsYf2hYdXOSrwFPATZKsk6bIbEl3UrhtPetgJXtme0b0i3mORYfM3jORPEbpuhDkiQtUEk2qaobx8W2bbegApw6grQkSdIsWOUaGFV1RHt/9QSvv5nq3CSL2swLkmwAPAu4HPga8OLWbH/gs217WdunHf9qVVWL75Nk/fZ0kcXAecD5wOIk2yZZj26hz2XtnMn6kCRJC9fnkjx0bKct7v25sf2qevtIspIkSffbdG4hef1Ux6vq3VMc3hw4sa2DsRZwSlV9PsllwMlJ3gZcCHy4tf8w8LEkK4AbaY9Bq6pLk5wCXAbcAxzcbk0hySHAGXSPUT2hqi5t13rzJH1IkqSF6+10RYy/olu36yTgFaNNSZIkzYbp3ELykCmOTblmRFVdDDx5gvhVdE8QGR+/k0keb1ZVRwNHTxA/HTh9un1IkqSFq6q+kGRd4Mt0Y5gXVNWPRpyWJEmaBassYFTVPwMkORE4tKpubvsbA++a0+wkSZKmIcl7ue8XKxvSPfnskLbo+GtHk5kkSZot036MKvDEseIFQFXdlOSPZldIkiSNwPJx+xeMJAtJkjRnZlLAWCvJxlV1E3SrfM/wfEmSpDlRVScCJHkQcOfAWllrA+uPMjdJkjQ7ZlKAeBfw3SRjjx97CROsSSFJkjRCZwHPBG5r+xvQrYfx1JFlJEmSZsW0CxhVdVKS5cBuLfTCqrpsbtKSJElaLQ+oqrHiBVV1W5IHjjIhSZI0O9aaSeOquqyq/qO9LF5IkqS+uT3JDmM7SXYEfruqk5IsTXJFkhVJDpui3YuSVJIls5SvJEmaJtewkCRJC8nrgFOT/AII8KfAy6Y6oa2T8T7gWcBK4Pwky8Z/WZPkIcChwLlzkLckSVoFCxiSJGnBqKrzkzwWeEwLXVFVv1vFaTsBK6rqKoAkJwN7A+Nnmx4FvBN40yymLEmSpmlGt5BIkiT1UZLd2vsLgecCj26v57bYVLYArh7YX9lig9ffAdiqqr6wijwOSrI8yfLrr79+hp9CkiRNxRkYkiRpIXgG8FW64sV4BXx6dS+cZC3g3cCrVtW2qo4HjgdYsmRJrW6fkiTpj1nAkCRJ815VHdHeX70ap18DbDWwv2WLjXkI8Hjg7CTQrauxLMnzqmr56mUsSZJmygKGJEma95K8fqrjVfXuKQ6fDyxOsi1d4WIf4OUD594CbDrQ19nAGy1eSJI0XBYwJEnSQvCQKY5NeStHVd2T5BDgDGBt4ISqujTJkcDyqlo2i3lKkqTVZAFDkiTNe1X1zwBJTgQOraqb2/7GwLumcf7pwOnjYm+ZpO2u9zNdSZK0GnwKiSRJWkieOFa8AKiqm4Anjy4dSZI0W+a0gJFkqyRfS3JZkkuTHNrimyQ5M8mV7X3jFk+SY5OsSHJxe2TZ2LX2b+2vTLL/QHzHJJe0c45NW11rsj4kSdKCttbg7/wkm+CMU0mSFoS5noFxD/CGqtoO2AU4OMl2wGHAWVW1GDir7QPsBSxur4OA4+APg48jgJ2BnYAjBgYnxwEHDpy3tMUn60OSJC1c7wK+m+SoJEcB3wH+dcQ5SZKkWTCnBYyquraqvte2fwNcDmwB7A2c2JqdCDy/be8NnFSdc4CNkmwO7AmcWVU3tqmgZwJL27GHVtU5VVXASeOuNVEfkiRpgaqqk4AXAr9qrxdW1cdGm5UkSZoNQ5tSmWQbuntQzwU2q6pr26FfApu17S2AqwdOW9liU8VXThBnij4kSdICVlWXAZeNOg9JkjS7hrKIZ5IHA/8FvK6qbh081mZOTPl4s/trqj6SHJRkeZLl119//VymIUmSJEmSVtOcFzCSrEtXvPh4VX26hX/Vbv+gvV/X4tcAWw2cvmWLTRXfcoL4VH3cR1UdX1VLqmrJokWLVu9DSpIkSZKkOTXXTyEJ8GHg8qp698ChZcDYk0T2Bz47EN+vPY1kF+CWdhvIGcAeSTZui3fuAZzRjt2aZJfW137jrjVRH5IkSZIkaZ6Z6zUwnga8ErgkyUUt9g/AO4BTkhwA/Ax4aTt2OvBsYAVwB/BqgKq6sa0kfn5rd2RV3di2XwN8FNgA+GJ7MUUfkiRJkiRpnpnTAkZVfQvIJId3n6B9AQdPcq0TgBMmiC8HHj9B/IaJ+pAkSZIkSfPPUBbxlCRJkiRJuj8sYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJWuMlWZrkiiQrkhw2wfHXJ7ksycVJzkryiFHkKUnSmswChiRJWqMlWRt4H7AXsB2wb5LtxjW7EFhSVU8ETgP+dbhZSpIkCxiSJGlNtxOwoqquqqq7gZOBvQcbVNXXquqOtnsOsOWQc5QkaY23zqgTkCRpTfPWt7511Cmslvma9zRsAVw9sL8S2HmK9gcAX5zoQJKDgIMAtt5669nKT5Ik4QwMSZKkaUvy18AS4N8mOl5Vx1fVkqpasmjRouEmJ0nSAjenBYwkJyS5LskPBmKbJDkzyZXtfeMWT5Jj2+JZFyfZYeCc/Vv7K5PsPxDfMckl7Zxjk2SqPiRJkiZwDbDVwP6WLXYfSZ4J/CPwvKq6a0i5SZKkZq5nYHwUWDoudhhwVlUtBs5q+9AtnLW4vQ4CjoOuGAEcQTeVcyfgiIGCxHHAgQPnLV1FH5IkSeOdDyxOsm2S9YB9gGWDDZI8GfgAXfHiuhHkKEnSGm9OCxhV9Q3gxnHhvYET2/aJwPMH4idV5xxgoySbA3sCZ1bVjVV1E3AmsLQde2hVnVNVBZw07loT9SFJknQfVXUPcAhwBnA5cEpVXZrkyCTPa83+DXgwcGqSi5Ism+RykiRpjoxiEc/Nquratv1LYLO2PdECWlusIr5ygvhUfUiSJP2RqjodOH1c7C0D288celKSJOk+RrqIZ5s5UaPsI8lBSZYnWX799dfPZSqSJEmSJGk1jaKA8at2+wftfew+0skW0JoqvuUE8an6+COuFi5JkiRJUv+NooCxDBh7ksj+wGcH4vu1p5HsAtzSbgM5A9gjycZt8c49gDPasVuT7NKePrLfuGtN1IckSZIkSZqH5nQNjCSfBHYFNk2yku5pIu8ATklyAPAz4KWt+enAs4EVwB3AqwGq6sYkR9GtEA5wZFWNLQz6GronnWwAfLG9mKIPSZIkSZI0D81pAaOq9p3k0O4TtC3g4EmucwJwwgTx5cDjJ4jfMFEfkiRJkiRpfhrpIp6SJEmSJEnTYQFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJkiRJvWcBQ5IkSZIk9Z4FDEmSJEmS1HsWMCRJkiRJUu9ZwJAkSZIkSb1nAUOSJEmSJPWeBQxJkiRJktR7FjAkSZIkSVLvWcCQJEmSJEm9ZwFDkiRJkiT13jqjTkDS3Pj6058x6hRWyzO+8fVRpyBJkiSph5yBIUmSJEmSem9BFzCSLE1yRZIVSQ4bdT6SJKmfVjVmSLJ+kk+14+cm2WYEaUqStEZbsLeQJFkbeB/wLGAlcH6SZVV12WgzkzSb/uMNnxt1CqvlkHc9d9QpSGqmOWY4ALipqh6VZB/gncDLhp+tJGnUTjl1p1GnMKGXvuS8Uacw5xZsAQPYCVhRVVcBJDkZ2BuwgCFJkgZNZ8ywN/DWtn0a8B9JUlU1zEQlSbo/nnTaGaNO4Y98/8V7TrttFurv3SQvBpZW1d+2/VcCO1fVIePaHQQc1HYfA1wx1ERhU+DXQ+5zFPycC4ufc2Hxcy4so/icj6iqRUPuc9ZMZ8yQ5Aetzcq2/+PW5tfjrjXMccV8/TNt3sM3X3Ofr3nD/M19vuYN8zd3857YhGOLhTwDY1qq6njg+FH1n2R5VS0ZVf/D4udcWPycC4ufc2FZUz5nXw1zXDFf/1ub9/DN19zna94wf3Ofr3nD/M3dvGdmIS/ieQ2w1cD+li0mSZI0aDpjhj+0SbIOsCFww1CykyRJwMIuYJwPLE6ybZL1gH2AZSPOSZIk9c90xgzLgP3b9ouBr7r+hSRJw7VgbyGpqnuSHAKcAawNnFBVl444rYmM7PaVIfNzLix+zoXFz7mwrCmfc9ZMNmZIciSwvKqWAR8GPpZkBXAjXZFj1Obrf2vzHr75mvt8zRvmb+7zNW+Yv7mb9wws2EU8JUmSJEnSwrGQbyGRJEmSJEkLhAUMSZIkSZLUexYwJEmSJElS7y3YRTw1Wkl2Aqqqzk+yHbAU+GFVnT7i1CRJklbJsYwWuoGnLv2iqr6S5OXAU4HLgeOr6ncjTVCagIt4DlmSxwJbAOdW1W0D8aVV9aXRZTZ7khwB7EVXIDsT2Bn4GvAs4IyqOnqE6c2ZJH8B7AT8oKq+POp8JEnqg/k49llIY5n5Mj5JsjNweVXdmmQD4DBgB+Ay4O1VdctIE5xCktcCn6mqq0edy0wk+Tjdn/EHAjcDDwY+DexO9+/E/Sc/e7SS/BnwQmAr4F7gR8AnqurWkSamOWcBY4jaX24H01U1twcOrarPtmPfq6odRpjerElyCd3nWx/4JbDlwC+jc6vqiaPMb7YkOa+qdmrbB9L9t/0MsAfwuap6xyjzk7TwJdkQOBx4PvAnQAHXAZ8F3lFVN48sOQ1NkldX1UdGncdE5uvYZz6PZebr+CTJpcCT2mONjwfuAE6j+8f0k6rqhSNNcApJbgFuB34MfBI4taquH21Wq5bk4qp6YpJ1gGuAh1fVvUkCfL+vf87b3yvPAb4BPBu4kK4A8wLgNVV19siS05xzDYzhOhDYsaqeD+wK/P+SHNqOZVRJzYF7qureqroD+PFYJbSqfgv8frSpzap1B7YPAp5VVf9MN0B4xWhSmn1JHprkX5J8rE0tHDz2/lHlNduS/GmS45K8L8nDkrw1ySVJTkmy+ajzmy1Jvpfkn5I8ctS5zKUkD05yZJJLk9yS5Pok5yR51ahzm2WnADcBu1bVJlX1MOAvW+yUkWamYfrnUScwhfk69pnPY5n5Oj5Zq6ruadtLqup1VfWtlvufjTKxabgK2BI4CtgRuCzJl5Lsn+Qho01tSmu120geQjcLY8MWX5/7/jnqmwOBvarqbcAzgcdV1T/S3eZ1zEgzW4UkGyZ5R5IfJrkxyQ1JLm+xjUad3+pK8sVh9eUaGMO11tjUyar6aZJdgdOSPIJ+/xKfqbuTPLD90t9xLNi+Kez7L/2ZWCvJxnSFwIxV2qvq9iT3TH3qvPIR4Ergv4C/SfIi4OVVdRewy0gzm10fBb4APIhumvDH6ar6zwf+D7D3qBKbZRsDGwFfS/JLum+KPlVVvxhpVrPv43TfOO4JvJTuv+vJwD8leXRV/cMok5tF21TVOwcDVfVL4J1J/mZEOWkOJLl4skPAZsPMZYbm69hnPo9l5uv45AcDs4m+n2RJVS1P8mig72sxVFX9Hvgy8OUk69LdgrQv8O/AolEmN4UPAz8E1gb+ETg1yVV047uTR5nYNKxDd+vI+nS3vlBVP28/+z47Bfgq3RcPv4TuSzRg/3ZsjxHmNqUkk82YC92MteHk4S0kw5Pkq8Drq+qigdg6wAnAK6pq7VHlNpuSrN/+cTs+vimweVVdMoK0Zl2Sn9INYkI3bftpVXVtkgcD36qq7UeY3qxJctHgZ0nyj3T/sH8ecGZfp//OVJILq+rJbfvnVbX1wLGLFtB/zz9M2U7y3+kGVy+km979yao6fpT5zZYk36+qJw3sn19V/y3JWsBlVfXYEaY3a5J8GfgKcGJV/arFNgNeRfet6zNHmJ5mUZJf0RXkbhp/CPhOVT18+Fmt2nwd+8znscx8HZ+04tB7gP8O/Jpu/Yur2+u1VfX9EaY3pcExxATHxgphvZTk4QBV9Ys2A+CZwM+r6ryRJjaFNovrAOBcuj8v76yqjyRZBPxXVT19pAlOIckVVfWYmR7rgyT3Al9n4uLzLlW1wTDycAbGcO0H3Kfy3abK7ZfkA6NJafZN9Au/xX9N9wtpQaiqbSY59Hu6e/AWivWTrNW+WaCqjk5yDd19hw8ebWqzavCWupPGHevlAPv+qqpvAt9M8nd0C9O9DFgQBQzg9iR/UVXfSvI84EaAqvp9u7d3oXgZ3UJ3X2+FiwJ+BSyjm3mihePzwIMHCwFjkpw99Gymb16OfebzWGa+jk/aIp2vSvJQYFu6f6esHCvO9tzLJjvQ5+IFdIWLge2b6dYd6bWqek+SrwB/Dryrqn7Y4tcDvS1eND9L8vdM/MVD3xeBvRz4H1V15fgDSYaWuzMwJE0pyb8CX66qr4yLLwXeW1WLR5PZ7EpyJPCvNbBCfos/im4xxBePJrPZleTkqtpn1HnMtSRPBD4ELAYuBf6mqn7Uvp3Zt6qOHWmCsyjdEx62BM6pefKEB0mS1kTt9q7D6G5N/pMWHvvi4R1VNX6WXW8keTFwSVVdMcGx51fV/x1KHhYwJK2u9Hjl+9nk51xYFtLnzDx9woMkSbqv+Tw+GWbuFjAkrbbxa0UsVH7OhWUhfc50j3p8SlXdlmQbuqm/H2vTaye9J1uSJPXLfB6fDDN318CQNKV5vPL9jPg5/Zzz1Hx9woMkSWuc+Tw+6UvuFjAkrcpmTLHy/fDTmTN+Tj/nfPSrJNuPLezYZmI8h+4JD08YaWaSJGm8+Tw+6UXuFjAkrcp8Xfl+pvycfs75aF4+4UGSpDXUfB6f9CJ318CQJEmSJEm9t9aoE5AkSZIkSVoVCxiSJEmSJKn3LGBIGrokM1roJ8muST4/V/lIkqT5zbGFtGawgCFp6KrqqaPOQZIkLRyOLaQ1gwUMSUOX5Lb2vmuSs5OcluSHST6eJO3Y0hb7HvDCgXMflOSEJOcluTDJ3i3+niRvadt7JvlGEv+OkyRpDeDYQloz+BhVSaP2ZOBxwC+AbwNPS7Ic+CCwG7AC+NRA+38EvlpVf5NkI+C8JF8BDgfOT/JN4Fjg2VX1++F9DEmS1BOOLaQFygqipFE7r6pWtgHBRcA2wGOBn1TVldU96/k/B9rvARyW5CLgbOABwNZVdQdwIHAm8B9V9eOhfQJJktQnji2kBcoZGJJG7a6B7XtZ9d9LAV5UVVdMcOwJwA3Aw2cpN0mSNP84tpAWKGdgSOqjHwLbJHlk29934NgZwN8N3M/65Pb+COANdNNG90qy8xDzlSRJ/ebYQloALGBI6p2quhM4CPhCW2jruoHDRwHrAhcnuRQ4qg04Pgy8sap+ARwAfCjJA4acuiRJ6iHHFtLCkO4WMEmSJEmSpP5yBoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiasSSvSvKtgf3bkvzZKs7ZJkklWWfuM5w0h1ck+fKo+pckaaHo81ggydOSXNlyev5c9iVpuEb2DwlJC0dVPXjUOUxHVX0c+Pio85AkaaHp2VjgSOA/quo9o05kvCRvBR5VVX896lyk+cgZGJLmjVHO3pAkSaM3zbHAI4BLZ+v6Sda+v9eQNDssYEiaUpKtknw6yfVJbkjyHxO0qSSPatsbJHlXkp8luSXJt5JsMME5L0ry0ySPn6LvsammByT5OfDVFv+bJJcnuSnJGUkeMXDOHkmuaH2/P8nXk/xtOzZ+uutTk5zf2p6f5KkDx85OclSSbyf5TZIvJ9l0NX+MkiTNW/NpLJDkx8CfAZ9rt5Csn2TDJB9Ocm2Sa5K8bawo0cYG305yTJIbgLcm+WiS45KcnuR24C+TPDzJf7WfwU+SvHYgx7cmOS3Jfya5FXjVJJ9lKfAPwMtabt9P8pIkF4xr9/okn23bH03yf5Kc2cYjXx837nlsO3ZjG/+8dLKfpbQQWMCQNKn2y/3zwM+AbYAtgJNXcdq/AzsCTwU2Af4e+P24674aeCfwzKr6wTRSeQbw58CeSfam++X/QmAR8E3gk+26mwKnAYcDDwOuaHlM9Nk2Ab4AHNvavhv4QpKHDTR7OfBq4E+A9YA3TiNXSZIWjPk2FqiqRwI/B55bVQ+uqruAjwL3AI8CngzsAfztwLV3Bq4CNgOObrGXt+2HAN8BPgd8v33+3YHXJdlz4Bp7041BNmKS21Wr6kvA24FPtdyeBCwDtk3y5wNNXwmcNLD/CuAoYFPgorHrJ3kQcCbwCbqxyj7A+5NsN+FPUFoALGBImspOwMOBN1XV7VV1Z1V9a7LGSdYC/gY4tKquqap7q+o7bfAw5nXAm4Bdq2rFNPN4a+v/t8D/BP6lqi6vqnvoBgLbt28jng1cWlWfbseOBX45yTX/Criyqj5WVfdU1SeBHwLPHWjzkar6Uev3FGD7aeYrSdJCMd/GAuPz2YxufPC6dv51wDF0/9gf84uqem8bD/y2xT5bVd+uqt8DTwAWVdWRVXV3VV0FfHDcNb5bVf+3qn4/cI1Vaj+XTwF/3fJ9HF2h6PMDzb5QVd9obf8ReEqSrYDnAD+tqo+03C8E/gt4yXT7l+YbCxiSprIV8LM2OJiOTYEHAD+eos2bgPdV1coZ5HH1wPYjgPckuTnJzcCNQOi+EXn4YNuqKmCyfh5O923SoJ+164wZLH7cAfRpgTJJkoZhvo0FxnsEsC5w7UD7D9DNWJjo2pP19/Cx89s1/oFuxsZU15iuE4GXJwnd7ItTxhV8Bsc2t9F93oe3vHYel9crgD+9H7lIveYCM5KmcjWwdZJ1pjlw+TVwJ/BIummWE9kD+FKSX1bVf00zjxqX09HtiSL3kWQxsOXAfgb3x/kF3S/+QVsDX5pmTpIkrQnm1VhgAlcDdwGbTpF/rSJ2NfCTqlo8zfym8kftquqcJHcD/53u1pWXj2uy1dhGkgfT3Zbzi5bX16vqWdPsW5r3nIEhaSrnAdcC70jyoCQPSPK0yRq3aZYnAO9ui12tneQpSdYfaHYpsBR4X5LnrUZO/wc4vE2xpC3MNTZV8gvAE5I8P90K4Acz+bcQpwOPTvLyJOskeRmwHfedsilJ0ppuvo0FxudzLfBl4F1JHppkrSSPTPKMGfR3HvCbJG9Ot0Dp2kken+S/rUbuvwK2abfaDDoJ+A/gdxPcovPsJH+RZD26tTDOqaqr6cYsj07yyiTrttd/G7eehrSgWMCQNKmqupduTYhH0S2ItRJ42SpOeyNwCXA+3RTHdzLu75qq+j7dfZsfTLLXDHP6TLvmyW2l7x8Ae7Vjv6a77/NfgRvoChLL6b55GX+dG1oOb2ht/x54TruGJEli/o0FJrEf3WLclwE30S22ufkM+ru35bo98BO6WSYfAjacSd7Nqe39hiTfG4h/DHg88J8TnPMJ4Ai6n+WOtPUyquo3dLNZ9qGbkfFLup/L+hNcQ1oQ0t0iLkkLT/t2YyXwiqr62qjzkSRJmki6x8xeB+xQVVcOxD8KrKyqfxpVblKfOAND0oKSZM8kG7Wpqv9At6jXOSNOS5IkaSr/Czh/sHgh6Y9ZwJA0UklekeS2CV6XruYln0K38vmv6aa8Pn8mjzOTJEnDNQdjgZFK8sVJPs8/TNL+p8ChdLe1SpqCt5BIkiRJkqTecwaGJEmSJEnqvXVGnUCfbLrpprXNNtuMOg1JkuaVCy644NdVtWjUefSN4wpJklbPZGMLCxgDttlmG5YvXz7qNCRJmleS/GzUOfSR4wpJklbPZGMLbyGRJEmSJEm9ZwFDkiRJkiT1ngUMSZIkSZLUexYwJEmSJElS71nAkCRJ80qSjZKcluSHSS5P8pQkmyQ5M8mV7X3j1jZJjk2yIsnFSXYYuM7+rf2VSfYfiO+Y5JJ2zrFJ0uIT9iFJkoZjTgsYSR6Q5Lwk309yaZJ/bvFtk5zbBgafSrJei6/f9le049sMXOvwFr8iyZ4D8aUttiLJYQPxCfuQJEnz3nuAL1XVY4EnAZcDhwFnVdVi4Ky2D7AXsLi9DgKOg64YARwB7AzsBBwxUJA4Djhw4LylLT5ZH5IkaQjm+jGqdwG7VdVtSdYFvpXki8DrgWOq6uQk/wc4gG6wcABwU1U9Ksk+wDuBlyXZDtgHeBzwcOArSR7d+ngf8CxgJXB+kmVVdVk7d6I+JElrmFNO3WnUKfyRl77kvFGnMC8l2RB4OvAqgKq6G7g7yd7Arq3ZicDZwJuBvYGTqqqAc9rsjc1b2zOr6sZ23TOBpUnOBh5aVee0+EnA84EvtmtN1IdW08+PfMKoU5h1W7/lklGnIEkL1pzOwKjObW133fYqYDfgtBY/kW5gAN3A4MS2fRqwe5u2uTdwclXdVVU/AVbQfVuyE7Ciqq5qA5iTgb3bOZP1IUmS5q9tgeuBjyS5MMmHkjwI2Kyqrm1tfgls1ra3AK4eOH9li00VXzlBnCn6kCRJQzDna2AkWTvJRcB1wJnAj4Gbq+qe1mRwYPCHwUQ7fgvwMGY++HjYFH2Mz++gJMuTLL/++uvvxyeVJElDsA6wA3BcVT0ZuJ1xt3K02RY1l0lM1ofjCkmS5s6cFzCq6t6q2h7Ykm7GxGPnus+ZqKrjq2pJVS1ZtGjRqNORJElTWwmsrKpz2/5pdAWNX7VbQ2jv17Xj1wBbDZy/ZYtNFd9ygjhT9PEHjiskSZo7Q3sKSVXdDHwNeAqwUZKx9TcGBwZ/GEy04xsCNzDzwccNU/QhSZLmqar6JXB1kse00O7AZcAyYOxJIvsDn23by4D92tNIdgFuabeBnAHskWTjtnjnHsAZ7ditSXZpt6TuN+5aE/UhSZKGYK6fQrIoyUZtewO6xTYvpytkvLg1Gz/IGBsYvBj4apuiuQzYpz2lZFu6FcHPA84HFrcnjqxHt9DnsnbOZH1IkqT57e+Ajye5GNgeeDvwDuBZSa4Entn2AU4HrqJbP+uDwGsA2uKdR9GNJc4Hjhxb0LO1+VA758d0C3gyRR+SJGkI5vopJJsDJyZZm65YckpVfT7JZcDJSd4GXAh8uLX/MPCxJCuAG+kKElTVpUlOofuG5R7g4Kq6FyDJIXTfoqwNnFBVl7ZrvXmSPiRJ0jxWVRcBSyY4tPsEbQs4eJLrnACcMEF8OfD4CeI3TNSHJEkajjktYFTVxcCTJ4hfRbcexvj4ncBLJrnW0cDRE8RPp/t2ZVp9SJIkSZKk+Wdoa2BIkiRJkiStLgsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkSZIkqfcsYEiSJEmSpN6zgCFJkiRJknrPAoYkSZIkSeo9CxiSJEmSJKn3LGBIkiRJkqTes4AhSZIkSZJ6zwKGJEmSJEnqPQsYkiRJkiSp9yxgSJIkSZKk3rOAIUmSJEmSes8ChiRJkiRJ6j0LGJIkaV5J8tMklyS5KMnyFtskyZlJrmzvG7d4khybZEWSi5PsMHCd/Vv7K5PsPxDfsV1/RTs3U/UhSZKGY04LGEm2SvK1JJcluTTJoS3+1iTXtIHHRUmePXDO4W3AcEWSPQfiS1tsRZLDBuLbJjm3xT+VZL0WX7/tr2jHt5nLzypJkobqL6tq+6pa0vYPA86qqsXAWW0fYC9gcXsdBBwHXTECOALYGdgJOGKgIHEccODAeUtX0YckSRqCuZ6BcQ/whqraDtgFODjJdu3YMW3gsX1VnQ7Qju0DPI5usPD+JGsnWRt4H90gZDtg34HrvLNd61HATcABLX4AcFOLH9PaSZKkhWlv4MS2fSLw/IH4SdU5B9goyebAnsCZVXVjVd0EnAksbcceWlXnVFUBJ4271kR9SJKkIZjTAkZVXVtV32vbvwEuB7aY4pS9gZOr6q6q+gmwgu5bkZ2AFVV1VVXdDZwM7N2mdO4GnNbOHz9gGRtknAbsPjYFVJIkzWsFfDnJBUkOarHNquratv1LYLO2vQVw9cC5K1tsqvjKCeJT9fEHSQ5KsjzJ8uuvv361PpwkSZrY0NbAaLdwPBk4t4UOafeinjAwZXOmg4yHATdX1T3j4ve5Vjt+S2s/Pi8HGpIkzS9/UVU70M3MPDjJ0wcPtpkTNZcJTNZHVR1fVUuqasmiRYvmMgVJktY4QylgJHkw8F/A66rqVrp7Sx8JbA9cC7xrGHlMxIGGJEnzS1Vd096vAz5DN1PzV+32D9r7da35NcBWA6dv2WJTxbecIM4UfUiSpCGY8wJGknXpihcfr6pPA1TVr6rq3qr6PfBBuoEHzHyQcQPdvazrjIvf51rt+IatvSRJmqeSPCjJQ8a2gT2AHwDLgLEniewPfLZtLwP2a08j2QW4pd0GcgawR5KN20zQPYAz2rFbk+zSbj3db9y1JupDkiQNwVw/hSTAh4HLq+rdA/HNB5q9gG7gAd3AYJ/2BJFt6Vb+Pg84H1jcnjiyHt1Cn8va9M2vAS9u548fsIwNMl4MfLW1lyRJ89dmwLeSfJ9ujPCFqvoS8A7gWUmuBJ7Z9gFOB66iW1frg8BrAKrqRuAoujHG+cCRLUZr86F2zo+BL7b4ZH1IkqQhWGfVTe6XpwGvBC5JclGL/QPdU0S2p7t39KfA/wCoqkuTnAJcRvcEk4Or6l6AJIfQfVuyNnBCVV3arvdm4OQkbwMupCuY0N4/lmQFcCNd0UOSJM1jVXUV8KQJ4jcAu08QL+DgSa51AnDCBPHlwOOn24ckSRqOOS1gVNW3gIme/HH6FOccDRw9Qfz0ic5rA5mdJojfCbxkJvlKkqThSPJoujWxNquqxyd5IvC8qnrbiFOTJEk9NbSnkEiSJA34IHA48DuAqroYZ0tKkqQpWMCQJEmj8MCqOm9c7J4JW0qSJGEBQ5IkjcavkzySbj0skryY7tHqkiRJE5rrRTwlSZImcjBwPPDYJNcAPwFeMdqUJElSn1nAkCRJQ9cW4X5mkgcBa1XVb0adkyRJ6jdvIZEkSUOX5GFJjgW+CZyd5D1JHjbqvCRJUn9ZwJAkSaNwMnA98CLgxW37UyPNSJIk9Zq3kEiSpFHYvKqOGth/W5KXjSwbSZLUe87AkCRJo/DlJPskWau9XgqcMeqkJElSf1nAkCRJo3Ag8AngrvY6GfgfSX6T5NaRZiZJknrJW0gkSdLQVdVDRp2DJEmaX5yBIUmShi7JfyV5dhLHIpIkaVocNEiSpFE4DngFcGWSdyR5zKgTkiRJ/WYBQ5IkDV1VfaWqXgHsAPwU+EqS7yR5dZJ1R5udJEnqIwsYkiRpJJI8DHgV8LfAhcB76AoaZ44wLUmS1FMzWsQzyRbAIwbPq6pvzHZSkiRpYUvyGeAxwMeA51bVte3Qp5IsH11mw7Hjm04adQqz7oJ/22/UKUiSFrhpFzCSvBN4GXAZcG8LF2ABQ5IkzdQHq+r0wUCS9avqrqpaMqqkJElSf83kFpLnA4+pqmdX1XPb63lTnZBkqyRfS3JZkkuTHNrimyQ5M8mV7X3jFk+SY5OsSHJxkh0GrrV/a39lkv0H4jsmuaSdc2ySTNWHJEnqhbdNEPvudE5MsnaSC5N8vu1vm+TcNhb4VJL1Wnz9tr+iHd9m4BqHt/gVSfYciC9tsRVJDhuIT9iHJEkanpkUMK4CZrqo1j3AG6pqO2AX4OAk2wGHAWdV1WLgrLYPsBewuL0OoluhnCSbAEcAOwM7AUcMFCSOAw4cOG9pi0/WhyRJGpEkf5pkR2CDJE9OskN77Qo8cJqXORS4fGD/ncAxVfUo4CbggBY/ALipxY9p7WhjkX2Ax9GNG97fiiJrA++jG49sB+zb2k7VhyRJGpKZFDDuAC5K8oE20+HYJMdOdUJVXVtV32vbv6EbbGwB7A2c2JqdSDe7gxY/qTrnABsl2RzYEzizqm6sqpvoFvda2o49tKrOqaoCThp3rYn6kCRJo7Mn8O/AlsC7Bl7/H/APqzo5yZbAXwEfavsBdgNOa03GjyvGxgKnAbu39nsDJ7fbVX4CrKD7gmQnYEVVXVVVdwMnA3uvog9JkjQkM1nEc1l7rZY2bfPJwLnAZgOLdf0S2KxtbwFcPXDayhabKr5ygjhT9DE+r4PoZnuw9dZbz/RjSZKkGaiqE4ETk7yoqv5rsnZJ9m9tx/vfwN8DD2n7DwNurqp72v7gWOAP44equifJLa39FsA5A9ccPGf8eGPnVfQxPm/HFZIkzZFpz8Bog4hPAhe01ycmGVj8kSQPBv4LeF1V3TruukW3GOicmaqPqjq+qpZU1ZJFixbNZRqSJKmZqnjRHDo+kOQ5wHVVdcHcZHX/Oa6QJGnuTLuA0e5NvZLu3tD3Az9K8vRpnLcuXfHi41X16Rb+Vbv9g/Z+XYtfA2w1cPqWLTZVfMsJ4lP1IUmS+i8TxJ4GPC/JT+lu79gNeA/dLadjs0oHxwJ/GD+04xsCNzDz8cYNU/QhSZKGZCZrYLwL2KOqnlFVT6e7h/WYqU5o94x+GLi8qt49cGgZMPYkkf2Bzw7E92tPI9kFuKXdBnIGsEeSjdvinXsAZ7RjtybZpfW137hrTdSHJEnqvz+aOVlVh1fVllW1Dd0inF+tqlcAXwNe3JqNH1eMjQVe3NpXi+/TnlKyLd0i4OcB5wOL2xNH1mt9LGvnTNaHJEkakpmsgbFuVV0xtlNVP2qzK6byNOCVwCVJLmqxfwDeAZyS5ADgZ8BL27HTgWfTLaZ1B/Dq1teNSY6iG1gAHFlVN7bt1wAfBTYAvtheTNGHJEnqv4lmYEzmzcDJSd4GXEj35Qnt/WNJVgA30hUkqKpLk5wCXEb3xLSDq+pegCSH0H1xsjZwQlVduoo+JEnSkMykgLE8yYeA/2z7rwCWT3VCVX2LyQcgu0/QvoCDJ7nWCcAJE8SXA4+fIH7DRH1IkqTRSrIW8OKqOmWKZt+e6hpVdTZwdtu+iu4JIuPb3Am8ZJLzjwaOniB+Ot0XKuPjE/YhSZKGZya3kPwvum8qXttel7WYJEnStFXV7+meJDJVm0OGlI4kSZonpj0Do6ruAt7dXpIkSffHV5K8EfgUcPtYcOAWUUmSpPtYZQEjySlV9dIklzDxglpPnJPMJEnSQvay9j5462gBfzaCXCRJ0jwwnRkYY89hf85cJiJJktYcVbXtqHOQJEnzyyoLGO1RpVTVz+Y+HUmStCZI8kDg9cDWVXVQksXAY6rq8yNOTZIk9dS0F/FM8pskt457XZ3kM0mc7ilJkmbiI8DdwFPb/jXA20aXjiRJ6ruZPEb1fwMrgU/QPRp1H+CRwPfoHm+66yznJkmSFq5HVtXLkuwLUFV3JJns0euSJEkzeozq86rqA1X1m6q6taqOB/asqk8BG89RfpIkaWG6O8kGtAXCkzwSuGu0KUmSpD6bSQHjjiQvTbJWe70UuLMd+6Onk0iSJE3hCOBLwFZJPg6cBfz9aFOSJEl9NpNbSF4BvAd4P13B4hzgr9u3J4fMQW6SJGkBSrIW3ezNFwK70N2aemhV/XqkiUmSpF6bdgGjqq4CnjvJ4W/NTjqSJGmhq6rfJ/n7qjoF+MKo85EkSfPDTJ5C8ugkZyX5Qdt/YpJ/mrvUJEnSAvaVJG9MslWSTcZeo05KkiT110xuIfkg8CbgAwBVdXGST+AjzyRpRi4/+qujTuGP/Pk/7jbqFLTmeVl7P3ggVoCPZpckSROaSQHjgVV13rgnnN0zy/lIkqQFrq2BcVh7kpkkSdK0zOQpJL9ujzgbe9zZi4Fr5yQrSZK0YFXV7+lmdUqSJE3bTGZgHAwcDzw2yTXAT+ieTCJJkjRTX0nyRuBTwO1jwaq6cXQpSZKkPptWASPJ2sBrquqZSR4ErFVVv5nb1CRJ0gLmGhiSJGlGpnULSVXdC/xF277d4oUkSbo/qmrbCV6rLF4keUCS85J8P8mlSf65xbdNcm6SFUk+lWS9Fl+/7a9ox7cZuNbhLX5Fkj0H4ktbbEWSwwbiE/YhSZKGYyZrYFyYZFmSVyZ54dhrqhOSnJDkurFHr7bYW5Nck+Si9nr2wLFZGUhMNViRJEmjl+SBSf4pyfFtf3GS50zj1LuA3arqScD2wNIkuwDvBI6pqkcBNwEHtPYHADe1+DGtHUm2A/YBHgcsBd6fZO026/R9wF7AdsC+rS1T9CFJkoZgJgWMBwA3ALsBz22vVQ00Pko3KBjvmKravr1Oh1kfSEw4WJEkSb3xEeBu4Klt/xqm8Wj26tzWdtdtr6Ibn5zW4icCz2/be7d92vHd0z1SbW/g5Kq6q6p+AqwAdmqvFVV1VVXdDZwM7N3OmawPSZI0BNMuYFTVqyd4/c3Y8SSHT3DON4DpLsY1mwOJyQYrkiSpHx5ZVf8K/A6gqu4ApvW7un3BcRFwHXAm8GPg5qoae7z7SmCLtr0FcHXr4x7gFuBhg/Fx50wWf9gUfQzmdlCS5UmWX3/99dP5OJIkaZpmMgNjVV4yg7aHJLm43WKycYvN5kBissHKH3GgIUnSSNydZAP+3+PZH0l3e8gqVdW9VbU9sCXdFx2PnaskZ6qqjq+qJVW1ZNGiRaNOR5KkBWU2CxjTneFwHPBIuvtWrwXeNYs5zJgDDUmSRuII4EvAVkk+DpwF/P1MLlBVNwNfA54CbJRk7OlqW9LdkkJ73wqgHd+Q7pbYP8THnTNZ/IYp+pAkSUMwmwWMmlajql+1b05+D3yQ7psTmN2BxGSDFUmSNEJJntY2vwG8EHgV8ElgSVWdPY3zFyXZqG1vADwLuJyukPHi1mx/4LNte1nbpx3/alVVi+/TFv7eFlgMnAecDyxuC4WvR7c+17J2zmR9SJKkIRj6DIwkmw/svgAYe0LJbA4kJhusSJKk0Tq2vX+3qm6oqi9U1eer6tfTPH9z4GtJLqYbI5xZVZ8H3gy8PskKuttGP9zafxh4WIu/HjgMoKouBU4BLqObCXJw+4LlHuAQ4Ay6wsgprS1T9CFJkoZgnVU36STZpKpuHBfbti24CXDqBOd8EtgV2DTJSrrporsm2Z5uxsZPgf8B3UAiydhA4h7aQKJdZ2wgsTZwwriBxMlJ3gZcyH0HKx9rA4wb6YoekiRp9H7XHp26ZZJjxx+sqtdOdXJVXQw8eYL4Vfy/WZ2D8TuZZJ2uqjoaOHqC+OnA6dPtQ5IkDce0CxjA55LsVVW3wh8ee3oK8HiAqnr7+BOqat8JrjPptxWzNZCYarAiSZJG6jnAM4E9gQtGnIskSZpHZlLAeDtdEeOvgMcAJwGvmJOsJEnSgtRuFTk5yeVV9f1R5yNJkuaPaRcwquoLSdYFvgw8BHhBVf1ozjKTJEkL2W+TnAVsVlWPT/JE4HlV9bZRJyZJkvpplQWMJO/lvk8Y2RD4MXBIklXeqypJkjSBDwJvAj4A3doWST4BWMCQJEkTms4MjOXj9r1fVZIk3V8PrKrzkvs8xOyeUSUjSZL6b5UFjKo6ESDJg4A7B54Msjaw/tymJ0mSFqhfJ3kkbZZnkhcD1442JUmS1GdrzaDtWcAGA/sbAF+Z3XQkSdIa4mC620cem+Qa4HXA/xxpRpIkqddm8hSSB1TVbWM7VXVbkgfOQU7SvPT1pz9j1Cn8kWd84+ujTkGS/kibxfmaqnpmm+G5VlX9ZtR5SZKkfpvJDIzbk+wwtpNkR+C3s5+SJElayNrtqH/Rtm+3eCFJkqZjJjMwXgecmuQXQIA/BV42F0lJkqQF78Iky4BTgdvHglX16dGlJEmS+mzaBYyqOj/JY4HHtNAVVfW7uUlLkiQtcA8AbgB2G4gVYAFDkiRNaJUFjCS7VdVXk7xw3KFHJ/GbEkmSNGNV9eqpjic5vKr+ZVj5SJKk/pvODIxnAF8FnjvBMb8pkSRJc+ElgAUMSZL0B6ssYFTVEe19ym9KJEmSZlFGnYAkSeqX6dxC8vqpjlfVu2cvHUmSJKCb5SlJkvQH07mF5CFTHHNwIUmS5oIzMCRJ0n1M5xaSfwZIciJwaFXd3PY3Bt41p9lJkqQFKckmVXXjuNi2VfWTtnvqCNKSJEk9ttYM2j5xrHgBUFU3AU+e9YwkSdKa4HNJHjq2k2Q74HNj+1X19pFkJUmSemsmBYy12qwLoPvmhFXM4EhyQpLrkvxg8LwkZya5sr1v3OJJcmySFUkuTrLDwDn7t/ZXJtl/IL5jkkvaOccmyVR9SJKk3ng7XRHjwUl2pJtx8derOinJVkm+luSyJJcmObTFHV9IkrTAzaSA8S7gu0mOSnIU8B3gX1dxzkeBpeNihwFnVdVi4Ky2D7AXsLi9DgKOgz8USo4AdgZ2Ao4YGDAcBxw4cN7SVfQhSZJ6oKq+ABwDfJluvPCCqrpoGqfeA7yhqrYDdgEObrM3HF9IkrTATbuAUVUnAS8EftVeL6yqj63inG8AN44L7w2c2LZPBJ4/ED+pOucAGyXZHNgTOLOqbmy3rZwJLG3HHlpV51RVASeNu9ZEfUiSpBFK8t42q+FYYDdgQ+AnwCEtNqWquraqvte2fwNcDmyB4wtJkha86TyF5A+q6jLgsvvZ52ZVdW3b/iWwWdveArh6oN3KFpsqvnKC+FR9/JEkB9F9I8PWW289088iSZJmZvm4/QtW90JJtqFbj+tcejK+cFwhSdLcmVEBY7ZVVSWZ00exrqqPqjoeOB5gyZIlPhZWkqQ5VFUnAiR5EHBnVd3b9tcG1p/udZI8GPj/t3fn4ZZU9b3/3x/BAUdACEGBgAYTiQNKB7ghURSFhqioQYMmgoRIcoXEJMYbHJ6AGnIhBv2JGnJROkAcEIiG1mBaggNxQLpBZJTQokgTpgCCiBPw/f1R6+jmcM7pbuizq/bp9+t59rNrf2tV1XcfmlN1vrVqrX8B/qyq7mjDVEwdo7frC68rJEmaP2szBsa6cmPrnkl7v6nFrwO2Hmm3VYvNFd9qhvhcx5AkScNwDrDRyOeNgP9Ykw2TPJSuePGRqvpEC3t9IUnSAtdHAWMpMDXS94HAmSPxA9po4bsCt7dumsuAPZNs0gbX2hNY1tbdkWTXNjr4AdP2NdMxJEnSMDyiqu6c+tCWH7m6jdo5/0Tgiqp698gqry8kSVrg5vURkiQfA3YHNkuyim6076OB05IcDFwDvLI1PwvYB1gJ3AUcBFBVt7ZZT5a3du+oqqmBQV9PN3L5RsBn2os5jiFJkobhB0mePTUgZ5tK9YdrsN1uwGuAS5Jc1GJvwesLSZIWvHktYFTVq2ZZtccMbQs4dJb9LAGWzBBfATxthvgtMx1DkiQNxp8Bpyf5byDALwK/u7qNqupLrf1MvL6QJGkB63UQT0mStH6qquVJfhX4lRa6sqp+2mdOkiRp2CxgSJKksUny/Kr6XJKXT1v1lCSMDMopSZJ0HxYwJEnSOD0X+Bzw4hnWFWABQ5IkzcgChiRJGpuqOqK9H9R3LpIkabJYwJAkSWOT5C/mWj9talRJkqSfsYAhSZLG6TFzrKuxZSFJkiaOBQxJkjQ2VfV2gCQnA2+oqu+1z5sAx/aYmiRJGriH9J2AJElaLz1jqngBUFW3Ac/qLx1JkjR0FjAkSVIfHtJ6XQCQZFPsGSpJkubghYIkSerDscBXk5zePr8COKrHfCRJ0sBZwJAkSWNXVackWQE8v4VeXlWX95mTJEkaNgsYkiSpF61gYdFCkiStEcfAkCRJkiRJg2cPjAXqu+94et8p3M82f31J3ylIkiRJkiaUPTAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNnmNgSOu597/xU32ncD+HHfvivlOQNFBJlgAvAm6qqqe12KbAx4Ftge8Ar6yq25IEeC+wD3AX8NqqurBtcyDwtrbbv6mqk1t8J+AkYCPgLOANVVWzHWOev64kSRrRWw+MJN9JckmSi9o88CTZNMnZSa5q75u0eJIcl2RlkouTPHtkPwe29le1i5Gp+E5t/yvbthn/t5QkSevYScDiabHDgXOqanvgnPYZYG9g+/Y6BDgeflbwOALYBdgZOGLqmqO1ed3IdotXcwxJkjQmfT9C8ryq2rGqFrXP47gAkSRJE6qqzgVunRbeFzi5LZ8MvHQkfkp1zgM2TrIlsBdwdlXd2npRnA0sbuseW1XnVVUBp0zb10zHkCRJY9J3AWO6cVyASJKkhWWLqrq+Ld8AbNGWnwhcO9JuVYvNFV81Q3yuY0iSpDHps4BRwGeTXJDkkBYbxwXIfSQ5JMmKJCtuvvnmB/N9JElSz9qNi+rrGF5XSJI0f/osYPxmVT2b7vGQQ5M8Z3TlOC5A2nFOqKpFVbVo8803n+/DSZKkde/G1vuS9n5Ti18HbD3SbqsWmyu+1QzxuY5xH15XSJI0f3orYFTVde39JuCTdGNYjOMCRJIkLSxLgamBvA8EzhyJH9AGA98VuL319FwG7JlkkzZ21p7AsrbujiS7tsG/D5i2r5mOIUmSxqSXaVSTPAp4SFV9vy3vCbyDn18cHM39L0AOS3Iq3YCdt1fV9UmWAX87MnDnnsCbq+rWJHe0i5Wv0V2AvG9c30+SpHXhmWcs6zuF+/nGfnv1evwkHwN2BzZLsopuMO+jgdOSHAxcA7yyNT+LbgrVlXTTqB4E0K4T3gksb+3eUVVTA4O+np9Po/qZ9mKOY0iSpDHppYBBN7bFJ9vMphsCH62qf0+ynPm/AJG0QBz1+/v1ncL9vPXDZ/Sdwrw58sgj+05hRkPNS/Ojql41y6o9ZmhbwKGz7GcJsGSG+ArgaTPEb5npGJIkaXx6KWBU1dXAM2eIz3hxsC4vQNbGTm865cFsPm8ueNcBfacgSZIkSdJYDW0aVUmSJEmSpPuxgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnw+pqFRJrRbu/bre8U7ufLf/LlvlOQJEmSpPWePTAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZIkSZI0eA7iKUmSJEkD8v43fqrvFNa5w459cd8paAGwB4YkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBcwwMSZIkSZIG7IqjPtd3CuvcU9/6/LXexh4YkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBm9Bj4GRZDHwXmAD4ENVdXTPKUmSpAnmtYUkjddRv79f3ymsc2/98Bl9pzCxFmwBI8kGwAeAFwKrgOVJllbV5f1mJkmSJpHXFtL8++Jzntt3Cuvcc8/9Yt8pSAvGQn6EZGdgZVVdXVU/AU4F9u05J0mSNLm8tpAkqUepqr5zmBdJ9gMWV9Ufts+vAXapqsOmtTsEOKR9/BXgynlKaTPgf+Zp3/NpUvOGyc19UvOGyc19UvOGyc3dvMdvPnP/parafJ72PRhrcm0xxuuKtTHJ/27XJX8OHX8OP+fPouPPoePPoTOUn8OM1xYL9hGSNVVVJwAnzPdxkqyoqkXzfZx1bVLzhsnNfVLzhsnNfVLzhsnN3bzHb5JznyTjuq5YG/637/hz6Phz+Dl/Fh1/Dh1/Dp2h/xwW8iMk1wFbj3zeqsUkSZIeCK8tJEnq0UIuYCwHtk+yXZKHAfsDS3vOSZIkTS6vLSRJ6tGCfYSkqu5OchiwjG6qsyVVdVmPKQ2qO+lamNS8YXJzn9S8YXJzn9S8YXJzN+/xm+TcB2GA1xZryv/2HX8OHX8OP+fPouPPoePPoTPon8OCHcRTkiRJkiQtHAv5ERJJkiRJkrRAWMCQJEmSJEmDZwFDkiRJkiQNngWMMUjym0n+IsmefecylyQPS3JAkhe0z69O8v4khyZ5aN/5rU6SJyX5yyTvTfLuJH+c5LF95yVJksYvya8m2SPJo6fFF/eVUx+S7Jzk19vyDu2adJ++8+pbklP6zmEIJuXvlHUtyS5Tfyck2SjJ25N8KskxSR7Xd37jkuRPk2y9+pbD4SCe8yDJ+VW1c1t+HXAo8ElgT+BTVXV0n/nNJslH6GameSTwPeDRwCeAPej+rRzYX3ZzS/KnwIuAc4F9gK/TfYeXAa+vqi/0lpyk+0nyC1V1U995SOuTJAdV1T/1ncc4tOuCQ4ErgB2BN1TVmW3dhVX17B7TG5skRwB7013fnQ3sAnweeCGwrKqO6jG9sUkyfbrjAM8DPgdQVS8Ze1I9mdS/U9a1JJcBz2yzS50A3AWcQfd3zzOr6uW9JjgmSW4HfgB8C/gYcHpV3dxvVnOzgDEPkny9qp7VlpcD+1TVzUkeBZxXVU/vN8OZJbm4qp6RZEPgOuAJVXVPkgDfqKpn9JzirJJcAuzY8n0kcFZV7Z5kG+DMqf8eml9JHl9Vt/Sdx1xaVf3NwEuBXwAKuAk4Ezi6qr7XW3IPUJLPVNXefecxmySbTg8BFwDPojsP3Tr+rFYvyeKq+ve2/Djg3cCvA5cCf15VN/aZ31ySLALeRfe7/M3AEmBn4L+AQ6rq6z2mp54k+W5VbdN3HuPQrgv+V1XdmWRbuj9M/rmq3jt6nbbQTV0fAQ8HbgC2qqo7kmwEfG3I13brUpILgcuBD9Gd90P3x9r+AFX1xf6yG69J/TtlXUtyRVU9tS3fp6iZ5KKq2rG35MYoydeBnYAXAL8LvITuGu1jwCeq6vs9pjcjHyGZHw9JskmSx9NdnN8MUFU/AO7uN7U5PSTJw4DH0PXCmOo+9XBg8I+Q0N1dgC7fRwNU1XcZcO5JfjHJ8Uk+kOTxSY5MckmS05Js2Xd+c0lydJLN2vKiJFcDX0tyTZLn9pzeXE4DbgN2r6pNq+rxdHdhbmvrBinJs2d57UR3cTpk/0N3Mpx6rQCeCFzYlofqb0eWjwWuB14MLAf+Xy8Zrbl/AP4O+DfgK8D/q6rHAYe3dVqgklw8y+sSYIu+8xujh1TVnQBV9R1gd2DvJO+m++N1fXF3Vd1TVXcB36qqOwCq6ofAvf2mNlaL6M4/bwVubz1zf1hVX1yfihfNpP6dsq5dmuSgtvyNVvgnyVOAn/aX1thVVd1bVZ+tqoOBJ9BdJywGru43tZltuPomegAeR/dLMkAl2bKqrm/PYA75pHki8E1gA7pf8Ke3P0p3BU7tM7E18CFgeZKvAb8FHAOQZHNgkHd3m5Po/sB4FF2Xzo/QPQLzUuAfgX37SmwN/HZVHd6W3wX8blUtb7/4P0p3sTBE21bVMaOBqroBOCbJH/SU05pYDnyRmX+HbDzeVNbam+i6K7+pqi4BSPLtqtqu37TWyqKRuzHvSTLYR+qah1bVZwCSHFNVZwBU1TlJ/r7f1DTPtgD2oivKjgpdMWt9cWOSHavqIoDWE+NFdL2R1os7zM1PkjyyFTB2mgq2XmXrTQGjqu6l+919enu/kfX376BJ/TtlXftD4L1J3kZ3o+WrSa4Frm3r1hf3+W9eVT8FlgJLW6/2wfERkjFq/wi2qKpv953LbJI8AaCq/jvJxnTdib5bVef3mtgaSPJrwFOBS6vqm33nsyamdeO7T9feoXdfS3IF8PT27OB5VbXryLpLhtoFMclngf8ATp56BCDJFsBrgRdW1Qt6TG9WSS4FXlZVV82w7tqqGvQATEm2At5Dd2FwBN1jaU/qN6u5JVlF99hI6J4RfnK1k+bUI3d95jeXJF+l+zk/Dvh7uuf//7X1jjq2qoZaYNSDlORE4J+q6kszrPtoVb26h7TGrv3OubsVqKev262qvtxDWmOX5OFV9eMZ4psBW04Vldc3SX4b2K2q3tJ3LkMxCX+nzId0A3luR1fQWjXkx0PnQ5KnVNV/9Z3H2rCAIfUoyTeq6plt+W+q6m0j6wZbBABI8id03emPBp4DbEI36OvzgSdV1Wt6TG9WSTah60a/L92dygJupKs2HzPg8Rj2Ay6pqitnWPfSqvrX8We19pK8BHgLXU+YX+w7n7mkG/xu1D+054R/Efi7qjqgj7zWRJJn0j1Cci/w58D/Bg6kGxPjdVW1Pt2JlyRJC4QFDKlHSd5B94fQndPiv0w3oOR+/WS2ZpLsTveH0VPoKtfXAv8KLKmqwT5HmeRXga3oBqu6cyT+s0Ebh6jl/US6gdcmJm+4b+7APXS9GS4deu4T/jN/Kt2zrBOXuyRJ0kwsYEgDlQme7m7IuWdCp9ab1LxhcnNvvYwOY8Lyhp/9zF9PN67RjkxQ7pIkSbNZXwevkSbB24FBFgHWwJBzfx2wU41MrZdk26p6L8MevGpS84bJzf0QJjNv6H7miyY0d0laa0m+UlW/sRbtdwf+sqpeNG9JSVrnLGBIPUpy8WyrGPh0dxOc+32m1msXMGck+SWG/YfdpOYNk5v7pOYNk527JK21tSleSJpcD+k7AWk9twVwAN1gmNNft/SY15qY1NxvTLLj1If2R96LgM0Y9tR6k5o3TG7uk5o3THbukrTWktzZ3ndP8oUkZyT5ZpKPJElbt7jFLgRePrLto5IsSXJ+kq8n2bfF35vkr9vyXknOTeLfT1KPHAND6tEkT3c3qblP6tR6k5o3TG7uk5o3THbukvRAJLmzqh7depydCfwa8N/Al4E3ASuAq+hmS1sJfBx4ZFW9KMnfApdX1YeTbAycDzyLbqay5XTjIf0jsE9VfWuc30vSfVnAkCRJkjTRphUw3lpVL2zx4+mKGJcCx1XVc1r8JcAhrYCxAngEMDWD2qbAXlV1RZLfAM4F/ryq3jfWLyXpfhwDQ5IkSdJC8uOR5XtY/d88AX6nqq6cYd3T6R6NfcI6yk3Sg+AzXJIkSZIWum8C2yZ5cvv8qpF1y4A/GRkr41nt/ZeAN9I9TrJ3kl3GmK+kGVjAkDR2Sb6ylu13T/Lp+cpHkiQtbFX1I7rpsf+tDeJ508jqdwIPBS5OchnwzlbMOJFuqtX/Bg4GPpTkEWNOXdIIx8CQNHjO1S5JkiTJHhiSxs6pziRJkiStLQfxlNS3Z3Hfqc52a6OBf5D7TnU25a3A56rqD6amOkvyH8CbgeVJ/hM4jm6qs3vH9zUkSZIkzSfvTkrq2/lVtaoVGy4CtgV+Ffh2VV1V3XNuHx5pvydweJKLgC/QTXu2TVXdBbwOOBt4v/O0S5IkSQuLPTAk9c2pziRJkiStlj0wJA2RU51JkiRJug8LGJIGx6nOJEmSJE3nNKqSJEmSJGnw7IEhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJIkSZKkwbOAIUmSJEmSBs8ChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoakiZbkC0n+sO88JEmSJM0vCxiSJEmSBinJa5N8aeTznUmetJpttk1SSTac/wwljZMFDEm98uJCkiStqap6dFVd3Xceo1qx5Jf7zkNaH1jAkDR2Sb6T5K+SXAz8IMnbknwryfeTXJ7kZSNtX5vkS0n+PsltSb6dZO9Z9rtlkouTvGlsX0aSJEnSWFjAkNSXVwG/DWwMXAn8FvA44O3Ah5NsOdJ2l9ZmM+DvgBOTZHRnSbYDvgi8v6reNe/ZS5KkdSrJ1kk+keTmJLckef8MbX7W2yHJRkmOTXJNktvbDY+NZtjmd9rNk6et5vi/meQrSb6X5Nokr23xk5J8IMm/tZstX0vy5Lbu3Lb5N9rjLb+bZLMkn277uTXJfybx7y5pHfB/JEl9Oa6qrq2qH1bV6VX131V1b1V9HLgK2Hmk7TVV9cGqugc4GdgS2GJk/Q7A54EjquqEsX0DSZK0TiTZAPg0cA2wLfBE4NTVbPb3wE7AbwCbAv8HuHfafg8CjgFeUFWXznH8XwI+A7wP2BzYEbhopMn+dDdZNgFWAkcBVNVz2vpntsdbPg68EVjV9rMF8BagVvNdJK0BCxiS+nLt1EKSA5Jc1O5UfA94Gl1viyk3TC1U1V1t8dEj638PuA44Y/7SlSRJ82hn4AnAm6rqB1X1o6r60myNW4+GPwDeUFXXVdU9VfWVqvrxSLM/A94E7F5VK1dz/FcD/1FVH6uqn1bVLVV10cj6T1bV+VV1N/ARugLHbH5Kd7Pll9q+/rOqLGBI64AFDEl9KfjZHY8PAocBj6+qjYFLgcy+6f0cCfwP8NF2B0eSJE2Wrel6XN69hu03Ax4BfGuONm8CPlBVq9bw+HPt64aR5bu4742U6d5F10vjs0muTnL4Ghxf0hqwgCGpb4+iK2bcDD/r6jnnM6oz+CnwiravU3zOVJKkiXMtsM1azE72P8CPgCfP0WZP4G1JfmcNjz/XvtZYVX2/qt5YVU8CXgL8RZI91sW+pfWdF/mSelVVlwPHAl8FbgSeDnz5AeznJ8DL6Z41XWIRQ5KkiXI+cD1wdJJHJXlEkt1ma1xV9wJLgHcneUKSDZL8ryQPH2l2GbAY+ECSl6zm+B8BXpDklUk2TPL4JDuuYe43Ak+a+pDkRUl+uQ04fjtwD9PG5pD0wKxphVOS1pmq2nba57cCb52l7UnASdNiGVnefWT5R8AL1lmikiRpLKrqniQvBo4DvkvXO/OjwIVzbPaXwP8FltM90vENYK9p+/1GkhcB/5bkp1X1mVmO/90k+9ANDPohusLD27jvQJ6zORI4uc2AcgjdAKTvpxvE8zbgH6rq82uwH0mrEceTkSRJkiRJQ2cXa0mSJEmSNHgWMCRJkiQteEl+L8mdM7wu6zs3SWvGR0gkSZIkSdLg2QNDkiRJkiQNnrOQjNhss81q22237TsNSZImygUXXPA/VbV533kMjdcVkiQ9MLNdW1jAGLHtttuyYsWKvtOQJGmiJLmm7xyGyOsKSZIemNmuLXyERJIkSZIkDZ4FDEmSJEmSNHgWMCRJkiRJ0uDNawEjydZJPp/k8iSXJXlDi2+a5OwkV7X3TVo8SY5LsjLJxUmePbKvA1v7q5IcOBLfKcklbZvjkmSuY0iSJEmSpMkz3z0w7gbeWFU7ALsChybZATgcOKeqtgfOaZ8B9ga2b69DgOOhK0YARwC7ADsDR4wUJI4HXjey3eIWn+0YkiRJkiRpwsxrAaOqrq+qC9vy94ErgCcC+wInt2YnAy9ty/sCp1TnPGDjJFsCewFnV9WtVXUbcDawuK17bFWdV1UFnDJtXzMdQ5IkSZIkTZixTaOaZFvgWcDXgC2q6vq26gZgi7b8RODakc1Wtdhc8VUzxJnjGNPzOoSutwfbbLPNfdbt9KZT1ui7jdsF7zpgtW2++46njyGTtbPNX1/SdwqSJPVqiNcWa3JdIUnSEIxlEM8kjwb+BfizqrpjdF3rOVHzefy5jlFVJ1TVoqpatPnmm89nGpIkSZIk6QGa9wJGkofSFS8+UlWfaOEb2+MftPebWvw6YOuRzbdqsbniW80Qn+sYkiRJkiRpwsz3LCQBTgSuqKp3j6xaCkzNJHIgcOZI/IA2G8muwO3tMZBlwJ5JNmmDd+4JLGvr7kiyazvWAdP2NdMxJEmSJEnShJnvMTB2A14DXJLkohZ7C3A0cFqSg4FrgFe2dWcB+wArgbuAgwCq6tYk7wSWt3bvqKpb2/LrgZOAjYDPtBdzHEOSJEmSJE2YeS1gVNWXgMyyeo8Z2hdw6Cz7WgIsmSG+AnjaDPFbZjqGJEmSJEmaPGMZxFOSJEmSJOnBsIAhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8CxgSJKkiZFkSZKbklw6EjsyyXVJLmqvfUbWvTnJyiRXJtlrJL64xVYmOXwkvl2Sr7X4x5M8rMUf3j6vbOu3HdNXliRJjQUMSZI0SU4CFs8Qf09V7dheZwEk2QHYH/i1ts0/JNkgyQbAB4C9gR2AV7W2AMe0ff0ycBtwcIsfDNzW4u9p7SRJ0hhZwJAkSROjqs4Fbl3D5vsCp1bVj6vq28BKYOf2WllVV1fVT4BTgX2TBHg+cEbb/mTgpSP7OrktnwHs0dpLkqQxsYAhSZIWgsOSXNweMdmkxZ4IXDvSZlWLzRZ/PPC9qrp7Wvw++2rrb2/t7yPJIUlWJFlx8803r5tvJkmSAAsYkiRp8h0PPBnYEbgeOLavRKrqhKpaVFWLNt98877SkCRpQbKAIUmSJlpV3VhV91TVvcAH6R4RAbgO2Hqk6VYtNlv8FmDjJBtOi99nX23941p7SZI0JhYwJEnSREuy5cjHlwFTM5QsBfZvM4hsB2wPnA8sB7ZvM448jG6gz6VVVcDngf3a9gcCZ47s68C2vB/wudZekiSNyYarbyJJkjQMST4G7A5slmQVcASwe5IdgQK+A/wRQFVdluQ04HLgbuDQqrqn7ecwYBmwAbCkqi5rh/gr4NQkfwN8HTixxU8E/jnJSrpBRPef328qSZKms4AhSZImRlW9aobwiTPEptofBRw1Q/ws4KwZ4lfz80dQRuM/Al6xVslKkqR1ykdIJEmSJEnS4FnAkCRJkiRJg2cBQ5IkSZIkDZ4FDEmSJEmSNHgWMCRJkiRJ0uBZwJAkSZIkSYNnAUOSJEmSJA2eBQxJkiRJkjR4FjAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZIkSZI0eBYwJEmSJEnS4FnAkCRJkiRJg2cBQ5IkSZIkDZ4FDEmSNDGSLElyU5JLR2LvSvLNJBcn+WSSjVt82yQ/THJRe/3jyDY7JbkkycokxyVJi2+a5OwkV7X3TVo8rd3Kdpxnj/mrS5K03rOAIUmSJslJwOJpsbOBp1XVM4D/At48su5bVbVje/3xSPx44HXA9u01tc/DgXOqanvgnPYZYO+Rtoe07SVJ0hhZwJAkSROjqs4Fbp0W+2xV3d0+ngdsNdc+kmwJPLaqzquqAk4BXtpW7wuc3JZPnhY/pTrnARu3/UiSpDGZ1wLGLN08j0xy3Uh3zn1G1r25dc28MsleI/HFLbYyyeEj8e2SfK3FP57kYS3+8PZ5ZVu/7Xx+T0mSNBh/AHxm5PN2Sb6e5ItJfqvFngisGmmzqsUAtqiq69vyDcAWI9tcO8s2kiRpDOa7B8ZJ3L+bJ8B7RrpzngWQZAdgf+DX2jb/kGSDJBsAH6DrurkD8KrWFuCYtq9fBm4DDm7xg4HbWvw9rZ0kSVrAkrwVuBv4SAtdD2xTVc8C/gL4aJLHrun+Wu+MWsscDkmyIsmKm2++eW02lSRJqzGvBYyZunnOYV/g1Kr6cVV9G1gJ7NxeK6vq6qr6CXAqsG8bbOv5wBlt++ndPKe6f54B7DE1OJckSVp4krwWeBHwe63wQLumuKUtXwB8C3gKcB33fcxkqxYDuHHq0ZD2flOLXwdsPcs2P1NVJ1TVoqpatPnmm6+jbydJkqC/MTAOayN4L5ka3ZvZu2bOFn888L2RZ15Hu3L+bJu2/vbW/n68UyJJ0mRLshj4P8BLququkfjmrScnSZ5ENwDn1e0RkTuS7NpucBwAnNk2Wwoc2JYPnBY/oM1Gsitw+8ijJpIkaQz6KGAcDzwZ2JGua+exPeTwM94pkSRpciT5GPBV4FeSrEpyMPB+4DHA2dOmS30OcHGSi+h6ZP5xVU31DH09q0w9FwAAHXFJREFU8CG6Hp/f4ufjZhwNvDDJVcAL2meAs4CrW/sPtu0lSdIYbTjuA1bVjVPLST4IfLp9nKtr5kzxW+hGAN+w9bIYbT+1r1VJNgQe19pLkqQJVlWvmiF84ixt/wX4l1nWrQCeNkP8FmCPGeIFHLpWyUqSpHVq7D0wpk059jJgaoaSpcD+bQaR7ei6eZ4PLAe2bzOOPIxuoM+l7ULi88B+bfvp3Tynun/uB3xu6nlYSZIkSZI0eea1B0br5rk7sFmSVcARwO5JdqQb1fs7wB8BVNVlSU4DLqcbQfzQqrqn7ecwYBmwAbCkqi5rh/gr4NQkfwN8nZ/fgTkR+OckK+kGEd1/Pr+nJEmSJEmaX/NawFibbp6t/VHAUTPEz6J79nR6/Gq6WUqmx38EvGKtkpUkSZIkSYPV1ywkkiRJkiRJa8wChiRJkiRJGjwLGJIkSZIkafAsYEiSJEmSpMGzgCFJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEkauyR/l+SxSR6a5JwkNyf5/b7zkiRJw7Vh3wlIkqT10p5V9X+SvAz4DvBy4Fzgw71mpQVnt/ft1ncK9/PlP/ly3ylI0kSyB4YkSerDQ9v7bwOnV9XtfSYjSZKGzx4YkiSpD59K8k3gh8D/TrI58KOec5IkSQNmDwxJktSHI4DfABZV1U+Bu4CXrG6jJEuS3JTk0pHYpknOTnJVe9+kxZPkuCQrk1yc5Nkj2xzY2l+V5MCR+E5JLmnbHJckcx1DkiSNjwUMSZLUh69W1a1VdQ9AVf0A+MwabHcSsHha7HDgnKraHjinfQbYG9i+vQ4BjoeuGEFXQNkF2Bk4YqQgcTzwupHtFq/mGJIkaUwsYEiSpLFJ8otJdgI2SvKsJM9ur92BR65u+6o6F7h1Wnhf4OS2fDLw0pH4KdU5D9g4yZbAXsDZrYByG3A2sLite2xVnVdVBZwybV8zHUOSJI2JY2BIkqRx2gt4LbAV8O6R+PeBtzzAfW5RVde35RuALdryE4FrR9qtarG54qtmiM91jPtIcghdbw+22WabB/JdJEnSLCxgSJKksamqk4GTk/xOVf3LPOy/ktS63u+aHqOqTgBOAFi0aNG85iFJ0vrGAoYkSerDp5O8GtiWkeuRqnrHA9jXjUm2rKrr22MgN7X4dcDWI+22arHrgN2nxb/Q4lvN0H6uY0iSpDFxDAxJktSHM+nGlbgb+MHI64FYCkzNJHJg2/dU/IA2G8muwO3tMZBlwJ5JNmmDd+4JLGvr7kiya5t95IBp+5rpGJIkaUzsgaFB2e19u/Wdwv18+U++3HcKkrQQbVVV02cTWa0kH6PrPbFZklV0s4kcDZyW5GDgGuCVrflZwD7ASrppWg8CqKpbk7wTWN7avaOqpgYGfT3dTCcb0c2KMjUzymzHkCRJY7LGBYwkD6+qH0+LbTpywpckSVpTX0ny9Kq6ZG02qqpXzbJqjxnaFnDoLPtZAiyZIb4CeNoM8VtmOoYkSRqftXmE5BNJHjr1oT3/efa6T0mSJK0HfhO4IMmVSS5OckmSi/tOSpIkDdfaPELyr3RdJ/ejGxBrKfCX85GUJEla8PbuOwFJkjRZ1riAUVUfTPIwukLGtsAfVdVX5ikvSZK0sG0JXFZV3wdI8ljgqXTjS0iSJN3PagsYSf5i9COwDXARsGuSXavq3fOUmyRJWriOB5498vnOGWKSJEk/syY9MB4z7fMnZolLkiStqbRBNgGoqnuTODuaJEma1WovFKrq7eNIRJIkrVeuTvKndL0uoJu+9Ooe85EkSQO3NtOoPoVu0M5tR7erquev+7QkSdIC98fAccDbgALOAQ7pNSNJkjRoa9NV83TgH4EPAffMTzqSJGl9UFU3AfvPtj7Jm6vq/44xJUmSNHBrU8C4u6qOX30zSZKkB+0VgAUMSZL0Mw9Zi7afSvL6JFsm2XTqNdcGSZYkuSnJpSOxTZOcneSq9r5JiyfJcUlWJrk4ybNHtjmwtb8qyYEj8Z2SXNK2OS5J5jqGJEmaGOk7AUmSNCxrU8A4EHgT8BXggvZasZptTgIWT4sdDpxTVdvTPe96eIvvDWzfXofQBvVqRZIjgF2AnYEjRgoSxwOvG9lu8WqOIUmSJkOtvokkSVqfrHEBo6q2m+H1pNVscy5w67TwvsDJbflk4KUj8VOqcx6wcZItgb2As6vq1qq6DTgbWNzWPbaqzmvTsJ0ybV8zHUOSJE0Ge2BIkqT7WKv51pM8DdgBeMRUrKpOWctjblFV17flG4At2vITgWtH2q1qsbniq2aIz3WM+0lyCG3U82222WYtv4r0c198znP7TuF+nnvuF/tOQZJmlGTTqrp1Wmy7qvp2+3h6D2lJkqQBW+MeGEmOAN7XXs8D/g54yYM5eOs5Ma9dRFd3jKo6oaoWVdWizTfffD5TkSRJP/epJI+d+pBkB+BTU5+r6m97yUqSJA3W2oyBsR+wB3BDVR0EPBN43AM45o3t8Q/a+00tfh2w9Ui7rVpsrvhWM8TnOoYkSRqGv6UrYjw6yU50PS5+v+ecJEnSgK1NAeNHVXUvcHe7Y3IT9y0srKmldAOC0t7PHIkf0GYj2RW4vT0GsgzYM8kmbfDOPYFlbd0dSXZts48cMG1fMx1DkiQNQFX9G/Ae4LN0g36/rKou6jMnSZI0bGs0BkYrEFycZGPgg3QzkNwJfHU1230M2B3YLMkqutlEjgZOS3IwcA3wytb8LGAfYCVwF3AQQFXdmuSdwPLW7h0jz8y+nu6iZyPgM+3FHMeQJEk9SvI+7vto5+OAbwGHJaGq/vQB7vdXgI+PhJ4E/DWwMd2MZTe3+Fuq6qy2zZuBg4F7gD+tqmUtvhh4L7AB8KGqOrrFtwNOBR5Pdy30mqr6yQPJV5Ikrb01KmBUVSXZuaq+B/xjkn+nmwHk4tVs96pZVu0x0zGAQ2fZzxJgyQzxFcDTZojfMtMxJElS76ZPwX7ButhpVV0J7AiQZAO6x0o/SXdD5D1V9fej7duYG/sDvwY8AfiPJE9pqz8AvJBugPDlSZZW1eXAMW1fpyb5R7rix/HrIn9JkrR6azMLyYVJfr2qllfVd+YrIUmStHBV1ckASR5F93jqPe3zBsDD19Fh9gC+VVXXdJ1IZ7QvcGpV/Rj4dpKVwM5t3cqqurrldSqwb5IrgOcDr25tTgaOxAKGJEljszYFjF2A30tyDfADuvnZq6qeMS+ZSZKkhewc4AV0j6RC9zjoZ4HfWAf73h/42Mjnw5IcQNf7441VdRvd1OvnjbQZnY59+vTtu9A9NvK9qrp7hvY/4/TsErz/jZ9afaMxO+zYF/edgqR1YG0G8dwLeDLd3YcXAy9q75IkSWvrEVU1VbygLT/ywe40ycPopnk/vYWOp7t+2RG4Hjj2wR5jLk7PLknS/FnjHhhVdc18JiJJktYrP0jy7Kq6EKBNpfrDdbDfvYELq+pGgKn3dowPAp9uH2ebpp1Z4rcAGyfZsPXCGG0vSZLGYG16YEiSJK0rfwacnuQ/k3yJbgaRw9bBfl/FyOMjSbYcWfcy4NK2vBTYP8nD2+wi2wPn0816tn2S7Vpvjv2BpW2w8c8D+7XtnaZdkqQxW5sxMCRJktaJqlqe5FeBX2mhK6vqpw9mn21g0BcCfzQS/rskO9JN3fqdqXVVdVmS04DLgbuBQ0cGFD0MWEY3jeqSqrqs7euvgFOT/A3wdeDEB5OvJElaOxYwJEnS2CR5flV9LsnLp616ShKq6hMPdN9V9QO6wTZHY6+Zo/1RwFEzxM8CzpohfjU/n6lEkiSNmQUMSZI0Ts8FPsfMA4EX8IALGJIkaWGzgCFJksamqo5o7wf1nYskSZosFjAkSdLYJPmLudZX1bvHlYskSZosFjAkSdI4PWaOdTW2LCRJ0sSxgCFJksamqt4OkORk4A1V9b32eRPg2B5TkyRJA/eQvhOQJEnrpWdMFS8Aquo24Fn9pSNJkobOAoYkSerDQ1qvCwCSbIo9QyVJ0hy8UJAkSX04FvhqktPb51cAR/WYjzQoX3zOc/tO4X6ee+4X+05B0nrOAoYkSRq7qjolyQrg+S308qq6vM+cJEnSsFnAkNZz73/jp/pO4X4OO/bFfacgaQxawcKihSRJWiOOgSFJkiRJkgbPAoYkSZIkSRo8HyGRJEmStN476vf36zuF+3nrh8/oOwVpUCxgSJIkabW++46n953C/Wzz15f0nYI0CFcc9bm+U7ifp771+atvJK0lHyGRJEmSJEmDZwFDkiQtCEm+k+SSJBe1KVpJsmmSs5Nc1d43afEkOS7JyiQXJ3n2yH4ObO2vSnLgSHyntv+VbduM/1tKkrT+soAhSZIWkudV1Y5Vtah9Phw4p6q2B85pnwH2BrZvr0OA46EreABHALsAOwNHTBU9WpvXjWy3eP6/jiRJmmIBQ5IkLWT7Aie35ZOBl47ET6nOecDGSbYE9gLOrqpbq+o24GxgcVv32Ko6r6oKOGVkX5IkaQwsYEiSpIWigM8muSDJIS22RVVd35ZvALZoy08Erh3ZdlWLzRVfNUNckiSNibOQSJpYTncmaZrfrKrrkvwCcHaSb46urKpKUvOZQCucHAKwzTbbzOehJEla79gDQ5IkLQhVdV17vwn4JN0YFje2xz9o7ze15tcBW49svlWLzRXfaob49BxOqKpFVbVo8803XxdfS5IkNRYwJEnSxEvyqCSPmVoG9gQuBZYCUzOJHAic2ZaXAge02Uh2BW5vj5osA/ZMskkbvHNPYFlbd0eSXdvsIweM7EuSJI2Bj5BIkqSFYAvgk21m0w2Bj1bVvydZDpyW5GDgGuCVrf1ZwD7ASuAu4CCAqro1yTuB5a3dO6rq1rb8euAkYCPgM+0lSXqAjjzyyL5TuJ8h5qSfs4AhSZImXlVdDTxzhvgtwB4zxAs4dJZ9LQGWzBBfATztQScrSZIeEB8hkSRJkiRJg9dbD4wk3wG+D9wD3F1Vi5JsCnwc2Bb4DvDKqrqtPWv6XrqunncBr62qC9t+DgTe1nb7N1V1covvxM+7eZ4FvKHdbZEkSZIk6QE57fSd+05hRq98xfl9pzDv+u6B8byq2rGqFrXPhwPnVNX2wDntM8DewPbtdQhwPEAreBwB7EI30vgRbcAtWpvXjWy3eP6/jiRJkiRJmg99FzCm2xc4uS2fDLx0JH5Kdc4DNm5Toe0FnF1Vt1bVbcDZwOK27rFVdV7rdXHKyL4kSZIkSdKE6bOAUcBnk1yQ5JAW26JNUwZwA92I4gBPBK4d2XZVi80VXzVD/H6SHJJkRZIVN99884P5PpIkSZIkaZ70OQvJb1bVdUl+ATg7yTdHV1ZVJZn3MSuq6gTgBIBFixY5RoakeXfFUZ/rO4X7eepbn993CpIkSdKceuuBUVXXtfebgE/SjWFxY3v8g/Z+U2t+HbD1yOZbtdhc8a1miEuSJEmSpAnUSwEjyaOSPGZqGdgTuBRYChzYmh0InNmWlwIHpLMrcHt71GQZsGeSTdrgnXsCy9q6O5Ls2mYwOWBkX5IkSZIkacL09QjJFsAnu9oCGwIfrap/T7IcOC3JwcA1wCtb+7PoplBdSTeN6kEAVXVrkncCy1u7d1TVrW359fx8GtXPtJckSZIkSeulZ56xrO8U7ucb++21xm17KWBU1dXAM2eI3wLsMUO8gENn2dcSYMkM8RXA0x50spIkSZIkqXdDm0ZVkiRJkiTpfvqchUSSNEGOPPLIvlOY0VDzkiRJ0rplDwxJkiRJkjR4FjAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZI08ZJsneTzSS5PclmSN7T4kUmuS3JRe+0zss2bk6xMcmWSvUbii1tsZZLDR+LbJflai388ycPG+y0lSVq/OQuJJGnBO+30nftO4X5e+Yrz+05hobkbeGNVXZjkMcAFSc5u695TVX8/2jjJDsD+wK8BTwD+I8lT2uoPAC8EVgHLkyytqsuBY9q+Tk3yj8DBwPHz/s0kSRJgDwxJkrQAVNX1VXVhW/4+cAXwxDk22Rc4tap+XFXfBlYCO7fXyqq6uqp+ApwK7JskwPOBM9r2JwMvnZcvI0mSZmQBQ5IkLShJtgWeBXythQ5LcnGSJUk2abEnAteObLaqxWaLPx74XlXdPS0+/diHJFmRZMXNN9+8rr6SJEnCAoYkSVpAkjwa+Bfgz6rqDrpHPJ4M7AhcDxw7n8evqhOqalFVLdp8883n81CSJK13HANDkqSBeuYZy/pO4X6+sd9eq2/UkyQPpStefKSqPgFQVTeOrP8g8On28Tpg65HNt2oxZonfAmycZMPWC2O0vSRJGgN7YEiSpInXxqg4Ebiiqt49Et9ypNnLgEvb8lJg/yQPT7IdsD1wPrAc2L7NOPIwuoE+l1ZVAZ8H9mvbHwicOZ/fSZIk3Zc9MCRJ0kKwG/Aa4JIkF7XYW4BXJdkRKOA7wB8BVNVlSU4DLqebweTQqroHIMlhwDJgA2BJVV3W9vdXwKlJ/gb4Ol3BRJIkjYkFDEmSNPGq6ktAZlh11hzbHAUcNUP8rJm2q6qr6WYpkSRJPfAREkmSJEmSNHgWMCRJkiRJ0uBZwJAkSZIkSYNnAUOSJEmSJA2eBQxJkiRJkjR4FjAkSZIkSdLgWcCQJEmSJEmDZwFDkiRJkiQNngUMSZIkSZI0eBYwJEmSJEnS4FnAkCRJkiRJg2cBQ5IkSZIkDZ4FDEmSJEmSNHgWMCRJkiRJ0uBZwJAkSZIkSYNnAUOSJEmSJA3egi5gJFmc5MokK5Mc3nc+kiRpsnltIUlSfxZsASPJBsAHgL2BHYBXJdmh36wkSdKk8tpCkqR+LdgCBrAzsLKqrq6qnwCnAvv2nJMkSZpcXltIktSjVFXfOcyLJPsBi6vqD9vn1wC7VNVh09odAhzSPv4KcOU8pbQZ8D/ztO/5NKl5w+TmPql5w+TmPql5w+Tmbt7jN5+5/1JVbT5P+x6MNbm2GON1BUzuv0fzHr9JzX1S84bJzX1S84bJzd28ZzbjtcWG83jAiVBVJwAnzPdxkqyoqkXzfZx1bVLzhsnNfVLzhsnNfVLzhsnN3bzHb5JznyTjuq6Ayf1vat7jN6m5T2reMLm5T2reMLm5m/faWciPkFwHbD3yeasWkyRJeiC8tpAkqUcLuYCxHNg+yXZJHgbsDyztOSdJkjS5vLaQJKlHC/YRkqq6O8lhwDJgA2BJVV3WY0pj6U46DyY1b5jc3Cc1b5jc3Cc1b5jc3M17/CY590Hw2mKdMe/xm9TcJzVvmNzcJzVvmNzczXstLNhBPCVJkiRJ0sKxkB8hkSRJkiRJC4QFDEmSJEmSNHgWMCRJkiRJ0uBZwNB9JPnVJHskefS0+OK+clpTSXZO8utteYckf5Fkn77zWltJTuk7hwciyW+2n/mefecylyS7JHlsW94oyduTfCrJMUke13d+s0nyp0m2Xn3L4UnysCQHJHlB+/zqJO9PcmiSh/ad31ySPCnJXyZ5b5J3J/njqX8/0jh5fu6f5+f5NannZ5jcc7TnZ00iB/EcsyQHVdU/9Z3HTJL8KXAocAWwI/CGqjqzrbuwqp7dY3pzSnIEsDfdzDpnA7sAnwdeCCyrqqN6TG9WSaZPvxfgecDnAKrqJWNPag0lOb+qdm7Lr6P7t/NJYE/gU1V1dJ/5zSbJZcAz22wCJwB3AWcAe7T4y3tNcBZJbgd+AHwL+BhwelXd3G9WaybJR+j+33wk8D3g0cAn6H7mqaoD+8tudu134ouAc4F9gK/T5f8y4PVV9YXektN6xfPz+Hl+Hr9JPT/D5J6jPT9rElnAGLMk362qbfrOYyZJLgH+V1XdmWRbupPGP1fVe5N8vaqe1W+Gs2u57wg8HLgB2Kqq7kiyEfC1qnpGn/nNJsmFwOXAh4Ciu0D6GLA/QFV9sb/s5jb6byLJcmCfqro5yaOA86rq6f1mOLMkV1TVU9vyfS78k1xUVTv2ltwcknwd2Al4AfC7wEuAC+j+vXyiqr7fY3pzSnJxVT0jyYbAdcATquqeJAG+MeD/Py8Bdmy5PhI4q6p2T7INcObAfyc+Dngz8FLgF+h+v9wEnAkcXVXf6y05rTXPz+Pn+Xn8JvX8DJN7jvb8PH6enx88HyGZB0kunuV1CbBF3/nN4SFVdSdAVX0H2B3YO8m76U7cQ3Z3Vd1TVXcB36qqOwCq6ofAvf2mNqdFdCe4twK3t4rxD6vqi0O+OGoekmSTJI+nK4beDFBVPwDu7je1OV2a5KC2/I0kiwCSPAX4aX9prVZV1b1V9dmqOhh4AvAPwGLg6n5TW62HJHkY8Bi6uzxTXYEfDgy6iyrdnSnocn00QFV9l+HnfRpwG7B7VW1aVY+nu3t8W1unyeL5efw8P4/fpJ6fYXLP0Z6fx29Bnp+TfGZcx9pw9U30AGwB7EX3D3FUgK+MP501dmOSHavqIoB2p+dFwBJgkNX6ET9J8sh2gbTTVLBVOQd7gVRV9wLvSXJ6e7+Ryfn/8nF0F3cBKsmWVXV9uuezh3xB/YfAe5O8Dfgf4KtJrgWubeuG6j4/06r6KbAUWNruPgzZicA3gQ3o/hg4PcnVwK7AqX0mthofApYn+RrwW8AxAEk2B27tM7E1sG1VHTMaqKobgGOS/EFPOemB8/w8Zp6fezGp52eY3HO05+fxm9jzc5LZHlcMXU+78eThIyTrXpITgX+qqi/NsO6jVfXqHtJarSRb0d0puWGGdbtV1Zd7SGuNJHl4Vf14hvhmwJZVdUkPaa21JL8N7FZVb+k7lweqnai3qKpv953LXNIN9LQd3QXpqqq6seeU5pTkKVX1X33n8UAleQJAVf13ko3putl+t6rO7zWx1Ujya8BTgUur6pt957OmknwW+A/g5Kl/20m2AF4LvLCqXtBjelpLnp/75/l5fCbt/AyTfY72/Dxek3x+TnIP8EVmLoTuWlUbjSUPCxiSJC0sSTYBDgf2pXvGFuBGujuCR1fV9B6CkiRpnk3y+TnJpcDLquqqGdZdW1VjmYnHAoYkSeuRDHg2LEmS1ldDPz8n2Q+4pKqunGHdS6vqX8eShwUMSZLWHxnwbFiSJK2vJvn8PM7iiwUMSZIWmCQXz7YKeEpVPXyc+UiSpIV7fh5n8WVSRlOWJElrblJnw5IkaSGb2PPzaoovW4wrDwsYkiQtPJ8GHj017eaoJF8YezaSJAkm+/w8iOKLj5BIkiRJkqRZJTkR+Keq+tIM6z5aVa8eSx4WMCRJkiRJ0tA9pO8EJEmSJEmSVscChqSxS7JWz8kl2T3Jp+crH0mSJEnDZwFD0thV1W/0nYMkSVo4vDkirR8sYEgauyR3tvfdk3whyRlJvpnkI0nS1i1usQuBl49s+6gkS5Kcn+TrSfZt8fcm+eu2vFeSc5P4O06SpPWAN0ek9YMX95L69izgz4AdgCcBuyV5BPBB4MXATsAvjrR/K/C5qtoZeB7wriSPAt4M/G6S5wHHAQdV1b1j+xaSJKk33hyR1g/+Dyipb+dX1apWbLgI2Bb4VeDbVXVVdVMlfXik/Z7A4UkuAr4APALYpqruAl4HnA28v6q+NbZvIEmShsSbI9ICtWHfCUha7/14ZPkeVv97KcDvVNWVM6x7OnAL8IR1lJskSZo851fVKoB2w2Nb4E7azZEW/zBwSGu/J/CSJH/ZPk/dHLkiyeuAc4E/9+aI1D97YEgaom8C2yZ5cvv8qpF1y4A/GekO+qz2/kvAG+nuuuydZJcx5itJkobjgd4c2bG9tqmqK9o6b45IA2IBQ9LgVNWP6O6K/Ft7TvWmkdXvBB4KXJzkMuCdrZhxIvCXVfXfwMHAh1p3UUmSJG+OSAtAusfLJUmSJGkyJbmzqh6dZHe6GxovavH3Ayuq6qQki4H/D7gL+E/gyVX1oiQbtfhv0N3g/TbdWBlnA8dV1dIkOwEnAb/ebrRI6oEFDEmSJEmSNHg+QiJJkiRJkgbPAoYkSZIkSRo8CxiSJEmSJGnwLGBIkiRJkqTBs4AhSZIkSZIGzwKGJEmSJEkaPAsYkiRJkiRp8P5/amPPP886sqQAAAAASUVORK5CYII=\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " click_article_id category_id created_at_ts words_count\n",
- "0 0 0 1513144419000 168\n",
- "1 1 1 1405341936000 189\n",
- "2 2 1 1408667706000 250\n",
- "3 3 1 1408468313000 230\n",
- "4 4 1 1407071171000 162\n",
- "364042 364042 460 1434034118000 144\n",
- "364043 364043 460 1434148472000 463\n",
- "364044 364044 460 1457974279000 177\n",
- "364045 364045 460 1515964737000 126\n",
- "364046 364046 460 1505811330000 479"
- ]
- },
- "execution_count": 20,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#新闻文章数据集浏览\n",
- "item_df.head().append(item_df.tail())"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:28:13.084501Z",
- "start_time": "2020-11-13T15:28:13.062561Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "176 3485\n",
- "182 3480\n",
- "179 3463\n",
- "178 3458\n",
- "174 3456\n",
- "183 3432\n",
- "184 3427\n",
- "173 3414\n",
- "180 3403\n",
- "177 3391\n",
- "170 3387\n",
- "187 3355\n",
- "169 3352\n",
- "185 3348\n",
- "175 3346\n",
- "181 3330\n",
- "186 3328\n",
- "189 3327\n",
- "171 3327\n",
- "172 3322\n",
- "165 3308\n",
- "188 3288\n",
- "167 3269\n",
- "190 3261\n",
- "192 3257\n",
- "168 3248\n",
- "193 3225\n",
- "166 3199\n",
- "191 3182\n",
- "194 3164\n",
- " ... \n",
- "601 1\n",
- "857 1\n",
- "1977 1\n",
- "1626 1\n",
- "697 1\n",
- "1720 1\n",
- "696 1\n",
- "706 1\n",
- "592 1\n",
- "1605 1\n",
- "586 1\n",
- "582 1\n",
- "1606 1\n",
- "972 1\n",
- "716 1\n",
- "584 1\n",
- "1608 1\n",
- "715 1\n",
- "841 1\n",
- "968 1\n",
- "964 1\n",
- "587 1\n",
- "1099 1\n",
- "1355 1\n",
- "711 1\n",
- "845 1\n",
- "710 1\n",
- "965 1\n",
- "847 1\n",
- "1535 1\n",
- "Name: words_count, Length: 866, dtype: int64"
- ]
- },
- "execution_count": 21,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_df['words_count'].value_counts()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:28:59.029535Z",
- "start_time": "2020-11-13T15:28:58.816106Z"
- }
- },
- "outputs": [
+ "source": [
+ "plt.figure()\n",
+ "plt.figure(figsize=(15, 20))\n",
+ "i = 1\n",
+ "for col in ['click_article_id', 'click_timestamp', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', \n",
+ " 'click_region', 'click_referrer_type', 'rank', 'click_cnts']:\n",
+ " plot_envs = plt.subplot(5, 2, i)\n",
+ " i += 1\n",
+ " v = trn_click[col].value_counts().reset_index()[:10]\n",
+ " fig = sns.barplot(x=v['index'], y=v[col])\n",
+ " for item in fig.get_xticklabels():\n",
+ " item.set_rotation(90)\n",
+ " plt.title(col)\n",
+ "plt.tight_layout()\n",
+ "plt.show()"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "461\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "注:此处click_cnts直方图表示的是每篇文章对应用户的点击次数累计图\n",
+ "\n",
+ "也可以以用户角度分析,画出每个用户点击文章次数的直方图"
+ ]
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "4 1084627\n",
+ "2 25894\n",
+ "1 2102\n",
+ "Name: click_environment, dtype: int64"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trn_click['click_environment'].value_counts()"
]
- },
- "execution_count": 22,
- "metadata": {},
- "output_type": "execute_result"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从点击环境click_environment来看,仅有2102次(占0.19%)点击环境为1;仅有25894次(占2.3%)点击环境为2;剩余(占97.6%)点击环境为4。"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "print(item_df['category_id'].nunique()) # 461个文章主题\n",
- "item_df['category_id'].hist()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(364047, 4)"
- ]
- },
- "execution_count": 23,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_df.shape # 364047篇文章"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻文章embedding向量表示"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " article_id \n",
- " emb_0 \n",
- " emb_1 \n",
- " emb_2 \n",
- " emb_3 \n",
- " emb_4 \n",
- " emb_5 \n",
- " emb_6 \n",
- " emb_7 \n",
- " emb_8 \n",
- " ... \n",
- " emb_240 \n",
- " emb_241 \n",
- " emb_242 \n",
- " emb_243 \n",
- " emb_244 \n",
- " emb_245 \n",
- " emb_246 \n",
- " emb_247 \n",
- " emb_248 \n",
- " emb_249 \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 0 \n",
- " -0.161183 \n",
- " -0.957233 \n",
- " -0.137944 \n",
- " 0.050855 \n",
- " 0.830055 \n",
- " 0.901365 \n",
- " -0.335148 \n",
- " -0.559561 \n",
- " -0.500603 \n",
- " ... \n",
- " 0.321248 \n",
- " 0.313999 \n",
- " 0.636412 \n",
- " 0.169179 \n",
- " 0.540524 \n",
- " -0.813182 \n",
- " 0.286870 \n",
- " -0.231686 \n",
- " 0.597416 \n",
- " 0.409623 \n",
- " \n",
- " \n",
- " 1 \n",
- " 1 \n",
- " -0.523216 \n",
- " -0.974058 \n",
- " 0.738608 \n",
- " 0.155234 \n",
- " 0.626294 \n",
- " 0.485297 \n",
- " -0.715657 \n",
- " -0.897996 \n",
- " -0.359747 \n",
- " ... \n",
- " -0.487843 \n",
- " 0.823124 \n",
- " 0.412688 \n",
- " -0.338654 \n",
- " 0.320786 \n",
- " 0.588643 \n",
- " -0.594137 \n",
- " 0.182828 \n",
- " 0.397090 \n",
- " -0.834364 \n",
- " \n",
- " \n",
- " 2 \n",
- " 2 \n",
- " -0.619619 \n",
- " -0.972960 \n",
- " -0.207360 \n",
- " -0.128861 \n",
- " 0.044748 \n",
- " -0.387535 \n",
- " -0.730477 \n",
- " -0.066126 \n",
- " -0.754899 \n",
- " ... \n",
- " 0.454756 \n",
- " 0.473184 \n",
- " 0.377866 \n",
- " -0.863887 \n",
- " -0.383365 \n",
- " 0.137721 \n",
- " -0.810877 \n",
- " -0.447580 \n",
- " 0.805932 \n",
- " -0.285284 \n",
- " \n",
- " \n",
- " 3 \n",
- " 3 \n",
- " -0.740843 \n",
- " -0.975749 \n",
- " 0.391698 \n",
- " 0.641738 \n",
- " -0.268645 \n",
- " 0.191745 \n",
- " -0.825593 \n",
- " -0.710591 \n",
- " -0.040099 \n",
- " ... \n",
- " 0.271535 \n",
- " 0.036040 \n",
- " 0.480029 \n",
- " -0.763173 \n",
- " 0.022627 \n",
- " 0.565165 \n",
- " -0.910286 \n",
- " -0.537838 \n",
- " 0.243541 \n",
- " -0.885329 \n",
- " \n",
- " \n",
- " 4 \n",
- " 4 \n",
- " -0.279052 \n",
- " -0.972315 \n",
- " 0.685374 \n",
- " 0.113056 \n",
- " 0.238315 \n",
- " 0.271913 \n",
- " -0.568816 \n",
- " 0.341194 \n",
- " -0.600554 \n",
- " ... \n",
- " 0.238286 \n",
- " 0.809268 \n",
- " 0.427521 \n",
- " -0.615932 \n",
- " -0.503697 \n",
- " 0.614450 \n",
- " -0.917760 \n",
- " -0.424061 \n",
- " 0.185484 \n",
- " -0.580292 \n",
- " \n",
- " \n",
- "
\n",
- "
5 rows × 251 columns
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1 678187\n",
+ "3 395558\n",
+ "4 38731\n",
+ "5 141\n",
+ "2 6\n",
+ "Name: click_deviceGroup, dtype: int64"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " article_id emb_0 emb_1 emb_2 emb_3 emb_4 emb_5 \\\n",
- "0 0 -0.161183 -0.957233 -0.137944 0.050855 0.830055 0.901365 \n",
- "1 1 -0.523216 -0.974058 0.738608 0.155234 0.626294 0.485297 \n",
- "2 2 -0.619619 -0.972960 -0.207360 -0.128861 0.044748 -0.387535 \n",
- "3 3 -0.740843 -0.975749 0.391698 0.641738 -0.268645 0.191745 \n",
- "4 4 -0.279052 -0.972315 0.685374 0.113056 0.238315 0.271913 \n",
- "\n",
- " emb_6 emb_7 emb_8 ... emb_240 emb_241 emb_242 \\\n",
- "0 -0.335148 -0.559561 -0.500603 ... 0.321248 0.313999 0.636412 \n",
- "1 -0.715657 -0.897996 -0.359747 ... -0.487843 0.823124 0.412688 \n",
- "2 -0.730477 -0.066126 -0.754899 ... 0.454756 0.473184 0.377866 \n",
- "3 -0.825593 -0.710591 -0.040099 ... 0.271535 0.036040 0.480029 \n",
- "4 -0.568816 0.341194 -0.600554 ... 0.238286 0.809268 0.427521 \n",
- "\n",
- " emb_243 emb_244 emb_245 emb_246 emb_247 emb_248 emb_249 \n",
- "0 0.169179 0.540524 -0.813182 0.286870 -0.231686 0.597416 0.409623 \n",
- "1 -0.338654 0.320786 0.588643 -0.594137 0.182828 0.397090 -0.834364 \n",
- "2 -0.863887 -0.383365 0.137721 -0.810877 -0.447580 0.805932 -0.285284 \n",
- "3 -0.763173 0.022627 0.565165 -0.910286 -0.537838 0.243541 -0.885329 \n",
- "4 -0.615932 -0.503697 0.614450 -0.917760 -0.424061 0.185484 -0.580292 \n",
- "\n",
- "[5 rows x 251 columns]"
- ]
- },
- "execution_count": 24,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_emb_df.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "(295141, 251)"
- ]
- },
- "execution_count": 25,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "item_emb_df.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 数据分析"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户重复点击"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:30:20.899771Z",
- "start_time": "2020-11-13T15:30:20.750817Z"
- }
- },
- "outputs": [],
- "source": [
- "#####merge\n",
- "user_click_merge = trn_click.append(tst_click)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:30:26.290038Z",
- "start_time": "2020-11-13T15:30:25.339579Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 0 \n",
- " 0 \n",
- " 30760 \n",
- " 1 \n",
- " \n",
- " \n",
- " 1 \n",
- " 0 \n",
- " 157507 \n",
- " 1 \n",
- " \n",
- " \n",
- " 2 \n",
- " 1 \n",
- " 63746 \n",
- " 1 \n",
- " \n",
- " \n",
- " 3 \n",
- " 1 \n",
- " 289197 \n",
- " 1 \n",
- " \n",
- " \n",
- " 4 \n",
- " 2 \n",
- " 36162 \n",
- " 1 \n",
- " \n",
- " \n",
- " 5 \n",
- " 2 \n",
- " 168401 \n",
- " 1 \n",
- " \n",
- " \n",
- " 6 \n",
- " 3 \n",
- " 36162 \n",
- " 1 \n",
- " \n",
- " \n",
- " 7 \n",
- " 3 \n",
- " 50644 \n",
- " 1 \n",
- " \n",
- " \n",
- " 8 \n",
- " 4 \n",
- " 39894 \n",
- " 1 \n",
- " \n",
- " \n",
- " 9 \n",
- " 4 \n",
- " 42567 \n",
- " 1 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "trn_click['click_deviceGroup'].value_counts()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从点击设备组click_deviceGroup来看,设备1占大部分(61%),设备3占36%。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 测试集用户点击日志"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 249999 \n",
+ " 160974 \n",
+ " 1506959142820 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 19 \n",
+ " 19 \n",
+ " 281 \n",
+ " 1506912747000 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 249999 \n",
+ " 160417 \n",
+ " 1506959172820 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 18 \n",
+ " 19 \n",
+ " 281 \n",
+ " 1506942089000 \n",
+ " 173 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 249998 \n",
+ " 160974 \n",
+ " 1506959056066 \n",
+ " 4 \n",
+ " 1 \n",
+ " 12 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 5 \n",
+ " 5 \n",
+ " 281 \n",
+ " 1506912747000 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 249998 \n",
+ " 202557 \n",
+ " 1506959086066 \n",
+ " 4 \n",
+ " 1 \n",
+ " 12 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 4 \n",
+ " 5 \n",
+ " 327 \n",
+ " 1506938401000 \n",
+ " 219 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 249997 \n",
+ " 183665 \n",
+ " 1506959088613 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 15 \n",
+ " 5 \n",
+ " 7 \n",
+ " 7 \n",
+ " 301 \n",
+ " 1500895686000 \n",
+ " 256 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "0 249999 160974 1506959142820 4 \n",
+ "1 249999 160417 1506959172820 4 \n",
+ "2 249998 160974 1506959056066 4 \n",
+ "3 249998 202557 1506959086066 4 \n",
+ "4 249997 183665 1506959088613 4 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "0 1 17 1 13 \n",
+ "1 1 17 1 13 \n",
+ "2 1 12 1 13 \n",
+ "3 1 12 1 13 \n",
+ "4 1 17 1 15 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
+ "0 2 19 19 281 1506912747000 \n",
+ "1 2 18 19 281 1506942089000 \n",
+ "2 2 5 5 281 1506912747000 \n",
+ "3 2 4 5 327 1506938401000 \n",
+ "4 5 7 7 301 1500895686000 \n",
+ "\n",
+ " words_count \n",
+ "0 259 \n",
+ "1 173 \n",
+ "2 259 \n",
+ "3 219 \n",
+ "4 256 "
+ ]
+ },
+ "execution_count": 16,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id count\n",
- "0 0 30760 1\n",
- "1 0 157507 1\n",
- "2 1 63746 1\n",
- "3 1 289197 1\n",
- "4 2 36162 1\n",
- "5 2 168401 1\n",
- "6 3 36162 1\n",
- "7 3 50644 1\n",
- "8 4 39894 1\n",
- "9 4 42567 1"
- ]
- },
- "execution_count": 27,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#用户重复点击\n",
- "user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['click_timestamp'].agg({'count'}).reset_index()\n",
- "user_click_count[:10]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:34:27.418638Z",
- "start_time": "2020-11-13T15:34:27.372761Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 311242 \n",
- " 86295 \n",
- " 74254 \n",
- " 10 \n",
- " \n",
- " \n",
- " 311243 \n",
- " 86295 \n",
- " 76268 \n",
- " 10 \n",
- " \n",
- " \n",
- " 393761 \n",
- " 103237 \n",
- " 205948 \n",
- " 10 \n",
- " \n",
- " \n",
- " 393763 \n",
- " 103237 \n",
- " 235689 \n",
- " 10 \n",
- " \n",
- " \n",
- " 576902 \n",
- " 134850 \n",
- " 69463 \n",
- " 13 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])\n",
+ "tst_click.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 5.180100e+05 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 518010.000000 \n",
+ " 5.180100e+05 \n",
+ " 518010.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 227342.428169 \n",
+ " 193803.792550 \n",
+ " 1.507387e+12 \n",
+ " 3.947300 \n",
+ " 1.738285 \n",
+ " 13.628467 \n",
+ " 1.348209 \n",
+ " 18.250250 \n",
+ " 1.819614 \n",
+ " 15.521785 \n",
+ " 30.043586 \n",
+ " 305.324961 \n",
+ " 1.506883e+12 \n",
+ " 210.966331 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 14613.907188 \n",
+ " 88279.388177 \n",
+ " 3.706127e+08 \n",
+ " 0.323916 \n",
+ " 1.020858 \n",
+ " 6.625564 \n",
+ " 1.703524 \n",
+ " 7.060798 \n",
+ " 1.082657 \n",
+ " 33.957702 \n",
+ " 56.868021 \n",
+ " 110.411513 \n",
+ " 5.816668e+09 \n",
+ " 83.040065 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 200000.000000 \n",
+ " 137.000000 \n",
+ " 1.506959e+12 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 2.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.000000 \n",
+ " 1.265812e+12 \n",
+ " 0.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 214926.000000 \n",
+ " 128551.000000 \n",
+ " 1.507026e+12 \n",
+ " 4.000000 \n",
+ " 1.000000 \n",
+ " 12.000000 \n",
+ " 1.000000 \n",
+ " 13.000000 \n",
+ " 1.000000 \n",
+ " 4.000000 \n",
+ " 10.000000 \n",
+ " 252.000000 \n",
+ " 1.506970e+12 \n",
+ " 176.000000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 229109.000000 \n",
+ " 199197.000000 \n",
+ " 1.507308e+12 \n",
+ " 4.000000 \n",
+ " 1.000000 \n",
+ " 17.000000 \n",
+ " 1.000000 \n",
+ " 21.000000 \n",
+ " 2.000000 \n",
+ " 8.000000 \n",
+ " 19.000000 \n",
+ " 323.000000 \n",
+ " 1.507249e+12 \n",
+ " 199.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 240182.000000 \n",
+ " 272143.000000 \n",
+ " 1.507666e+12 \n",
+ " 4.000000 \n",
+ " 3.000000 \n",
+ " 17.000000 \n",
+ " 1.000000 \n",
+ " 25.000000 \n",
+ " 2.000000 \n",
+ " 18.000000 \n",
+ " 35.000000 \n",
+ " 399.000000 \n",
+ " 1.507630e+12 \n",
+ " 232.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 249999.000000 \n",
+ " 364043.000000 \n",
+ " 1.508832e+12 \n",
+ " 4.000000 \n",
+ " 5.000000 \n",
+ " 20.000000 \n",
+ " 11.000000 \n",
+ " 28.000000 \n",
+ " 7.000000 \n",
+ " 938.000000 \n",
+ " 938.000000 \n",
+ " 460.000000 \n",
+ " 1.509949e+12 \n",
+ " 3082.000000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "count 518010.000000 518010.000000 5.180100e+05 518010.000000 \n",
+ "mean 227342.428169 193803.792550 1.507387e+12 3.947300 \n",
+ "std 14613.907188 88279.388177 3.706127e+08 0.323916 \n",
+ "min 200000.000000 137.000000 1.506959e+12 1.000000 \n",
+ "25% 214926.000000 128551.000000 1.507026e+12 4.000000 \n",
+ "50% 229109.000000 199197.000000 1.507308e+12 4.000000 \n",
+ "75% 240182.000000 272143.000000 1.507666e+12 4.000000 \n",
+ "max 249999.000000 364043.000000 1.508832e+12 4.000000 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
+ "mean 1.738285 13.628467 1.348209 18.250250 \n",
+ "std 1.020858 6.625564 1.703524 7.060798 \n",
+ "min 1.000000 2.000000 1.000000 1.000000 \n",
+ "25% 1.000000 12.000000 1.000000 13.000000 \n",
+ "50% 1.000000 17.000000 1.000000 21.000000 \n",
+ "75% 3.000000 17.000000 1.000000 25.000000 \n",
+ "max 5.000000 20.000000 11.000000 28.000000 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id \\\n",
+ "count 518010.000000 518010.000000 518010.000000 518010.000000 \n",
+ "mean 1.819614 15.521785 30.043586 305.324961 \n",
+ "std 1.082657 33.957702 56.868021 110.411513 \n",
+ "min 1.000000 1.000000 1.000000 1.000000 \n",
+ "25% 1.000000 4.000000 10.000000 252.000000 \n",
+ "50% 2.000000 8.000000 19.000000 323.000000 \n",
+ "75% 2.000000 18.000000 35.000000 399.000000 \n",
+ "max 7.000000 938.000000 938.000000 460.000000 \n",
+ "\n",
+ " created_at_ts words_count \n",
+ "count 5.180100e+05 518010.000000 \n",
+ "mean 1.506883e+12 210.966331 \n",
+ "std 5.816668e+09 83.040065 \n",
+ "min 1.265812e+12 0.000000 \n",
+ "25% 1.506970e+12 176.000000 \n",
+ "50% 1.507249e+12 199.000000 \n",
+ "75% 1.507630e+12 232.000000 \n",
+ "max 1.509949e+12 3082.000000 "
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
],
- "text/plain": [
- " user_id click_article_id count\n",
- "311242 86295 74254 10\n",
- "311243 86295 76268 10\n",
- "393761 103237 205948 10\n",
- "393763 103237 235689 10\n",
- "576902 134850 69463 13"
- ]
- },
- "execution_count": 28,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_count[user_click_count['count']>7]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:32:53.298575Z",
- "start_time": "2020-11-13T15:32:53.285611Z"
- }
- },
- "outputs": [
+ "source": [
+ "tst_click.describe()"
+ ]
+ },
{
- "data": {
- "text/plain": [
- "array([ 1, 2, 4, 3, 6, 5, 10, 7, 13])"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "我们可以看出训练集和测试集的用户是完全不一样的\n",
+ "\n",
+ "训练集的用户ID由0 ~ 199999,而测试集A的用户ID由200000 ~ 249999。\n",
+ "\n",
+ "因此,也就是我们在训练时,需要把测试集的数据也包括在内,称为全量数据。\n",
+ "\n",
+ "!!!!!!!!!!!!!!!后续将对训练集和测试集合并分析!!!!!!!!!!!"
]
- },
- "execution_count": 29,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_count['count'].unique()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "1 1605541\n",
- "2 11621\n",
- "3 422\n",
- "4 77\n",
- "5 26\n",
- "6 12\n",
- "10 4\n",
- "7 3\n",
- "13 1\n",
- "Name: count, dtype: int64"
- ]
- },
- "execution_count": 30,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#用户点击新闻次数\n",
- "user_click_count.loc[:,'count'].value_counts() "
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "###### 可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击环境变化分析"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:39:41.961797Z",
- "start_time": "2020-11-13T15:39:41.949829Z"
- }
- },
- "outputs": [],
- "source": [
- "def plot_envs(df, cols, r, c):\n",
- " plt.figure()\n",
- " plt.figure(figsize=(10, 5))\n",
- " i = 1\n",
- " for col in cols:\n",
- " plt.subplot(r, c, i)\n",
- " i += 1\n",
- " v = df[col].value_counts().reset_index()\n",
- " fig = sns.barplot(x=v['index'], y=v[col])\n",
- " for item in fig.get_xticklabels():\n",
- " item.set_rotation(90)\n",
- " plt.title(col)\n",
- " plt.tight_layout()\n",
- " plt.show()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:39:55.476626Z",
- "start_time": "2020-11-13T15:39:48.764592Z"
- }
- },
- "outputs": [
+ },
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "50000"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#测试集中的用户数量为5w\n",
+ "tst_click.user_id.nunique()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:56:07.717463Z",
+ "start_time": "2020-11-13T15:56:07.693494Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1"
+ ]
+ },
+ "execution_count": 19,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tst_click.groupby('user_id')['click_article_id'].count().min() # 注意测试集里面有只点击过一次文章的用户"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻文章信息数据表"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:20:34.183761Z",
+ "start_time": "2020-11-13T15:20:34.164770Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " click_article_id \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " 0 \n",
+ " 1513144419000 \n",
+ " 168 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 1 \n",
+ " 1 \n",
+ " 1405341936000 \n",
+ " 189 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 2 \n",
+ " 1 \n",
+ " 1408667706000 \n",
+ " 250 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 3 \n",
+ " 1 \n",
+ " 1408468313000 \n",
+ " 230 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 4 \n",
+ " 1 \n",
+ " 1407071171000 \n",
+ " 162 \n",
+ " \n",
+ " \n",
+ " 364042 \n",
+ " 364042 \n",
+ " 460 \n",
+ " 1434034118000 \n",
+ " 144 \n",
+ " \n",
+ " \n",
+ " 364043 \n",
+ " 364043 \n",
+ " 460 \n",
+ " 1434148472000 \n",
+ " 463 \n",
+ " \n",
+ " \n",
+ " 364044 \n",
+ " 364044 \n",
+ " 460 \n",
+ " 1457974279000 \n",
+ " 177 \n",
+ " \n",
+ " \n",
+ " 364045 \n",
+ " 364045 \n",
+ " 460 \n",
+ " 1515964737000 \n",
+ " 126 \n",
+ " \n",
+ " \n",
+ " 364046 \n",
+ " 364046 \n",
+ " 460 \n",
+ " 1505811330000 \n",
+ " 479 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " click_article_id category_id created_at_ts words_count\n",
+ "0 0 0 1513144419000 168\n",
+ "1 1 1 1405341936000 189\n",
+ "2 2 1 1408667706000 250\n",
+ "3 3 1 1408468313000 230\n",
+ "4 4 1 1407071171000 162\n",
+ "364042 364042 460 1434034118000 144\n",
+ "364043 364043 460 1434148472000 463\n",
+ "364044 364044 460 1457974279000 177\n",
+ "364045 364045 460 1515964737000 126\n",
+ "364046 364046 460 1505811330000 479"
+ ]
+ },
+ "execution_count": 20,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#新闻文章数据集浏览\n",
+ "item_df.head().append(item_df.tail())"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:28:13.084501Z",
+ "start_time": "2020-11-13T15:28:13.062561Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "176 3485\n",
+ "182 3480\n",
+ "179 3463\n",
+ "178 3458\n",
+ "174 3456\n",
+ "183 3432\n",
+ "184 3427\n",
+ "173 3414\n",
+ "180 3403\n",
+ "177 3391\n",
+ "170 3387\n",
+ "187 3355\n",
+ "169 3352\n",
+ "185 3348\n",
+ "175 3346\n",
+ "181 3330\n",
+ "186 3328\n",
+ "189 3327\n",
+ "171 3327\n",
+ "172 3322\n",
+ "165 3308\n",
+ "188 3288\n",
+ "167 3269\n",
+ "190 3261\n",
+ "192 3257\n",
+ "168 3248\n",
+ "193 3225\n",
+ "166 3199\n",
+ "191 3182\n",
+ "194 3164\n",
+ " ... \n",
+ "601 1\n",
+ "857 1\n",
+ "1977 1\n",
+ "1626 1\n",
+ "697 1\n",
+ "1720 1\n",
+ "696 1\n",
+ "706 1\n",
+ "592 1\n",
+ "1605 1\n",
+ "586 1\n",
+ "582 1\n",
+ "1606 1\n",
+ "972 1\n",
+ "716 1\n",
+ "584 1\n",
+ "1608 1\n",
+ "715 1\n",
+ "841 1\n",
+ "968 1\n",
+ "964 1\n",
+ "587 1\n",
+ "1099 1\n",
+ "1355 1\n",
+ "711 1\n",
+ "845 1\n",
+ "710 1\n",
+ "965 1\n",
+ "847 1\n",
+ "1535 1\n",
+ "Name: words_count, Length: 866, dtype: int64"
+ ]
+ },
+ "execution_count": 21,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_df['words_count'].value_counts()"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:28:59.029535Z",
+ "start_time": "2020-11-13T15:28:58.816106Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "461\n"
+ ]
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 22,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "print(item_df['category_id'].nunique()) # 461个文章主题\n",
+ "item_df['category_id'].hist()"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(364047, 4)"
+ ]
+ },
+ "execution_count": 23,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_df.shape # 364047篇文章"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻文章embedding向量表示"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " article_id \n",
+ " emb_0 \n",
+ " emb_1 \n",
+ " emb_2 \n",
+ " emb_3 \n",
+ " emb_4 \n",
+ " emb_5 \n",
+ " emb_6 \n",
+ " emb_7 \n",
+ " emb_8 \n",
+ " ... \n",
+ " emb_240 \n",
+ " emb_241 \n",
+ " emb_242 \n",
+ " emb_243 \n",
+ " emb_244 \n",
+ " emb_245 \n",
+ " emb_246 \n",
+ " emb_247 \n",
+ " emb_248 \n",
+ " emb_249 \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " -0.161183 \n",
+ " -0.957233 \n",
+ " -0.137944 \n",
+ " 0.050855 \n",
+ " 0.830055 \n",
+ " 0.901365 \n",
+ " -0.335148 \n",
+ " -0.559561 \n",
+ " -0.500603 \n",
+ " ... \n",
+ " 0.321248 \n",
+ " 0.313999 \n",
+ " 0.636412 \n",
+ " 0.169179 \n",
+ " 0.540524 \n",
+ " -0.813182 \n",
+ " 0.286870 \n",
+ " -0.231686 \n",
+ " 0.597416 \n",
+ " 0.409623 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 1 \n",
+ " -0.523216 \n",
+ " -0.974058 \n",
+ " 0.738608 \n",
+ " 0.155234 \n",
+ " 0.626294 \n",
+ " 0.485297 \n",
+ " -0.715657 \n",
+ " -0.897996 \n",
+ " -0.359747 \n",
+ " ... \n",
+ " -0.487843 \n",
+ " 0.823124 \n",
+ " 0.412688 \n",
+ " -0.338654 \n",
+ " 0.320786 \n",
+ " 0.588643 \n",
+ " -0.594137 \n",
+ " 0.182828 \n",
+ " 0.397090 \n",
+ " -0.834364 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 2 \n",
+ " -0.619619 \n",
+ " -0.972960 \n",
+ " -0.207360 \n",
+ " -0.128861 \n",
+ " 0.044748 \n",
+ " -0.387535 \n",
+ " -0.730477 \n",
+ " -0.066126 \n",
+ " -0.754899 \n",
+ " ... \n",
+ " 0.454756 \n",
+ " 0.473184 \n",
+ " 0.377866 \n",
+ " -0.863887 \n",
+ " -0.383365 \n",
+ " 0.137721 \n",
+ " -0.810877 \n",
+ " -0.447580 \n",
+ " 0.805932 \n",
+ " -0.285284 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 3 \n",
+ " -0.740843 \n",
+ " -0.975749 \n",
+ " 0.391698 \n",
+ " 0.641738 \n",
+ " -0.268645 \n",
+ " 0.191745 \n",
+ " -0.825593 \n",
+ " -0.710591 \n",
+ " -0.040099 \n",
+ " ... \n",
+ " 0.271535 \n",
+ " 0.036040 \n",
+ " 0.480029 \n",
+ " -0.763173 \n",
+ " 0.022627 \n",
+ " 0.565165 \n",
+ " -0.910286 \n",
+ " -0.537838 \n",
+ " 0.243541 \n",
+ " -0.885329 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 4 \n",
+ " -0.279052 \n",
+ " -0.972315 \n",
+ " 0.685374 \n",
+ " 0.113056 \n",
+ " 0.238315 \n",
+ " 0.271913 \n",
+ " -0.568816 \n",
+ " 0.341194 \n",
+ " -0.600554 \n",
+ " ... \n",
+ " 0.238286 \n",
+ " 0.809268 \n",
+ " 0.427521 \n",
+ " -0.615932 \n",
+ " -0.503697 \n",
+ " 0.614450 \n",
+ " -0.917760 \n",
+ " -0.424061 \n",
+ " 0.185484 \n",
+ " -0.580292 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
5 rows × 251 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " article_id emb_0 emb_1 emb_2 emb_3 emb_4 emb_5 \\\n",
+ "0 0 -0.161183 -0.957233 -0.137944 0.050855 0.830055 0.901365 \n",
+ "1 1 -0.523216 -0.974058 0.738608 0.155234 0.626294 0.485297 \n",
+ "2 2 -0.619619 -0.972960 -0.207360 -0.128861 0.044748 -0.387535 \n",
+ "3 3 -0.740843 -0.975749 0.391698 0.641738 -0.268645 0.191745 \n",
+ "4 4 -0.279052 -0.972315 0.685374 0.113056 0.238315 0.271913 \n",
+ "\n",
+ " emb_6 emb_7 emb_8 ... emb_240 emb_241 emb_242 \\\n",
+ "0 -0.335148 -0.559561 -0.500603 ... 0.321248 0.313999 0.636412 \n",
+ "1 -0.715657 -0.897996 -0.359747 ... -0.487843 0.823124 0.412688 \n",
+ "2 -0.730477 -0.066126 -0.754899 ... 0.454756 0.473184 0.377866 \n",
+ "3 -0.825593 -0.710591 -0.040099 ... 0.271535 0.036040 0.480029 \n",
+ "4 -0.568816 0.341194 -0.600554 ... 0.238286 0.809268 0.427521 \n",
+ "\n",
+ " emb_243 emb_244 emb_245 emb_246 emb_247 emb_248 emb_249 \n",
+ "0 0.169179 0.540524 -0.813182 0.286870 -0.231686 0.597416 0.409623 \n",
+ "1 -0.338654 0.320786 0.588643 -0.594137 0.182828 0.397090 -0.834364 \n",
+ "2 -0.863887 -0.383365 0.137721 -0.810877 -0.447580 0.805932 -0.285284 \n",
+ "3 -0.763173 0.022627 0.565165 -0.910286 -0.537838 0.243541 -0.885329 \n",
+ "4 -0.615932 -0.503697 0.614450 -0.917760 -0.424061 0.185484 -0.580292 \n",
+ "\n",
+ "[5 rows x 251 columns]"
+ ]
+ },
+ "execution_count": 24,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_emb_df.head()"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "(295141, 251)"
+ ]
+ },
+ "execution_count": 25,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "item_emb_df.shape"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 数据分析"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户重复点击"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:30:20.899771Z",
+ "start_time": "2020-11-13T15:30:20.750817Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "#####merge\n",
+ "user_click_merge = trn_click.append(tst_click)"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:30:26.290038Z",
+ "start_time": "2020-11-13T15:30:25.339579Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 \n",
+ " 0 \n",
+ " 30760 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 1 \n",
+ " 0 \n",
+ " 157507 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 1 \n",
+ " 63746 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 3 \n",
+ " 1 \n",
+ " 289197 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 4 \n",
+ " 2 \n",
+ " 36162 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 5 \n",
+ " 2 \n",
+ " 168401 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 6 \n",
+ " 3 \n",
+ " 36162 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 7 \n",
+ " 3 \n",
+ " 50644 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 8 \n",
+ " 4 \n",
+ " 39894 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ " 9 \n",
+ " 4 \n",
+ " 42567 \n",
+ " 1 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id count\n",
+ "0 0 30760 1\n",
+ "1 0 157507 1\n",
+ "2 1 63746 1\n",
+ "3 1 289197 1\n",
+ "4 2 36162 1\n",
+ "5 2 168401 1\n",
+ "6 3 36162 1\n",
+ "7 3 50644 1\n",
+ "8 4 39894 1\n",
+ "9 4 42567 1"
+ ]
+ },
+ "execution_count": 27,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#用户重复点击\n",
+ "user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['click_timestamp'].agg({'count'}).reset_index()\n",
+ "user_click_count[:10]"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:34:27.418638Z",
+ "start_time": "2020-11-13T15:34:27.372761Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 311242 \n",
+ " 86295 \n",
+ " 74254 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 311243 \n",
+ " 86295 \n",
+ " 76268 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 393761 \n",
+ " 103237 \n",
+ " 205948 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 393763 \n",
+ " 103237 \n",
+ " 235689 \n",
+ " 10 \n",
+ " \n",
+ " \n",
+ " 576902 \n",
+ " 134850 \n",
+ " 69463 \n",
+ " 13 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id count\n",
+ "311242 86295 74254 10\n",
+ "311243 86295 76268 10\n",
+ "393761 103237 205948 10\n",
+ "393763 103237 235689 10\n",
+ "576902 134850 69463 13"
+ ]
+ },
+ "execution_count": 28,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_count[user_click_count['count']>7]"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:32:53.298575Z",
+ "start_time": "2020-11-13T15:32:53.285611Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "array([ 1, 2, 4, 3, 6, 5, 10, 7, 13])"
+ ]
+ },
+ "execution_count": 29,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_count['count'].unique()"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "1 1605541\n",
+ "2 11621\n",
+ "3 422\n",
+ "4 77\n",
+ "5 26\n",
+ "6 12\n",
+ "10 4\n",
+ "7 3\n",
+ "13 1\n",
+ "Name: count, dtype: int64"
+ ]
+ },
+ "execution_count": 30,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#用户点击新闻次数\n",
+ "user_click_count.loc[:,'count'].value_counts() "
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "###### 可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击环境变化分析"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
},
{
- "data": {
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:39:41.961797Z",
+ "start_time": "2020-11-13T15:39:41.949829Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def plot_envs(df, cols, r, c):\n",
+ " plt.figure()\n",
+ " plt.figure(figsize=(10, 5))\n",
+ " i = 1\n",
+ " for col in cols:\n",
+ " plt.subplot(r, c, i)\n",
+ " i += 1\n",
+ " v = df[col].value_counts().reset_index()\n",
+ " fig = sns.barplot(x=v['index'], y=v[col])\n",
+ " for item in fig.get_xticklabels():\n",
+ " item.set_rotation(90)\n",
+ " plt.title(col)\n",
+ " plt.tight_layout()\n",
+ " plt.show()"
]
- },
- "metadata": {},
- "output_type": "display_data"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:39:55.476626Z",
+ "start_time": "2020-11-13T15:39:48.764592Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAr8AAAFgCAYAAACymRGJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAA9kUlEQVR4nO3dedxtc93/8df7IHOiI5mOUwpRMpxSuO8kFUWGkqSBulN3Km5Rqe7I1Eh3GoQyZSZ+KSUiiQYOyVSSeR4zD3G8f3+s76V9Ltew93Xtfa09vJ+Px3pcew17rc+1z/U5389e67u+S7aJiIiIiBgE0+oOICIiIiJiqqT4jYiIiIiBkeI3IiIiIgZGit+IiIiIGBgpfiMiIiJiYKT4jYiIiIiBkeI3IgaapO0lXdAw/4ikl47znpmSLGneSR77RkkbTWYfZT/jxhwREZUUv21WZ0PaLmlIY5DZXsT29XXH0Yp2xSzpzZJ+I+lhSfdJukzSZyUt0I44I3pdP7TxkeK34wa5Ia2TpA0k3Vp3HBG9QtLWwCnAccAKtl8IbAMsByw/ynvSmMdA64f2chCl+I2WSJqn7hgiJkrS8pJOlXRPObP53RG2saSXldcLSjpA0k2SHpR0gaQFR3jPO0sXhleOc/z3l33dJ+kLw9ZNk/Q5SdeV9SdJWqKs+6WkTwzb/i+StmolZkmvk/R7SQ+U929Qlgs4ENjb9mG27wewfY3tT9q+tmy3l6RTJB0j6SFge0nLSDpd0v2S/iHpIw0xHilp34b5ub6Uls9sD0lXS/qnpCNyljkiOi3F7yR0QUM6YkNW1p0naR9JF5ZLmGdJml7WtdKQHinpYEm/kPQo8EZJryj7f0DSVZLe0bCfIyV9T9IZ5bh/krTisM/j45KuLev3kbRi+T0eKg3+8xq231TVpdcHyjarN6y7UdJuki4vn+eJkhaQtDDwS2AZVZekHpG0zFifZfQ/VV/cfg7cBMwElgVOGOdt3wTWBtYFlgA+AzwzbL87AF8DNrJ95RjHXxU4GHg/sAzwQqqzqkM+CWwBvKGs/yfwvbLueGDbYftaATij2ZglLVu237cs3w34iaQlgZVLLD8Z47MYsjnVGeIXAMdSfYa3lpjfBewvacMm9jNkO+CtwIrASsAXW3hvRMd0QRv/jtLGPlDa3Fc0rPuspNtKO3qNpDe143ceGLYzTWAC5gH+AnwLWBhYAFgf2B64oGE7Ay8rr78HnEfV6M5D1TjNT9UQG5gX2AH4x9B7xjj+ssB9wNuovsS8ucwvWdafB1xH1ZgsWOa/WtZ9ALiwYV+rAg8A848Q85HAg8B65TiLlvg+DzwP2BB4GFi5Yfv7gNeW3+dY4IRhn8dPgecDqwFPAucALwUWA64GPli2XRO4G1infF4fBG5siPNG4CKqRncJ4K/Ax8q6DYBb6/47ydQ9E/B64B5g3mHLR8zZ8vf+OPDqEfY1lLO7lb/Z5Zo4/peG5cLCwL+oimbK3++bGtYvDTxV8mhR4FGq7ggA+wGHtxjzZ4EfD1v2q5JX65d9LNCw7oTy/8JjwPvLsr2A8xu2WR6YAyzasOwrwJHl9ZHAvg3r5srLksMfa5h/G3Bd3X8rmTJRfxu/Usn5NwPzUX2J/QdVu7sycAuwTNl2JrBi3Z9ZL0058ztxr6Uquna3/ajtJ2xfMNrGkqYBHwJ2tn2b7Tm2f2/7yYbNdgF2Bzaw/Y9xjv8+4Be2f2H7GdtnA7OpGo8hR9j+u+3HgZOANcry04A1JK1Q5rcDTh0WS6Of2r7Q9jNlH4tQFdL/sn0u1dm0bRu2P832Rbafpip+1xi2v6/bfsj2VcCVwFm2r7f9INUZ2zXLdjsCh9j+U/m8jqIqll/XsK+DbN/u6jLtz0Y4VsSQ5YGbyt9lM6ZTNXjXjbHN7sD3bDfTv3wZqgYLANuPUn1RHLICcFo5y/MAVTE8B1jK9sNUZ23fU7bdliq3Wol5BWDrof2XY6xPVWQPxbF0Q3zvsf0C4FKqhnzILQ2vlwHuL/ENuYmq8W9W4/5uKvuMqFvdbfw2wBm2z7b9FNUVnQWpCuo5VEX1qpLms32j7bH+n4phUvxOXN0N6VgN2ZA7G14/RlW00kJDOmR4Y3dLKYSHDG/sRjxug7saXj8+wvzQ9isAnx72Oy7P3I3jeMeKGHILMEPN36R1L/AE1eX40bwF+KKkdzaxvztouHFM0kJUXR8a49vE9gsapgVs31bWHw9sK+n1VP+X/KbFmG+hOvPbuP+FbX8VuAa4Ddiqid/DDa9vB5aQtGjDshllX1CduVqoYd2LR9hf4810M8o+I+pWdxu/DFXbCkBpc28Bli2F8y5UV2LulnRCuva1JsXvxNXdkI7VkDWjmYZ0yPDGbvnyLXdIY2PXTrcA+w37HReyfXwT7/X4m8SAuYiqAP2qpIVL//D1Rtu4NDaHAwequqlrHkmvlzR/w2ZXARsD32vs+z6KU4BNJa1f+rXvzdz/B/8A2G/oioykJSVt3rD+F1RfCPcGThz2BbSZmI8BNpP01rJ8AVU3oC1X3vdpYE9JH5G0uCovB5Ya4zO6Bfg98JWyv9WBD5djAVwGvE3SEpJeTNVgD7eTpOVU3dz3BeDEsT7EiClSdxt/O1W+A8/elLo8pa21fZzt9cs2prrvIJqU4nfi6m5IR23Imox/3IZ0FH+iOsP6GUnzqbrJbjPGv3FoIg4DPiZpndIQLyzp7cPOMo3mLuCFkhbrQFzRg2zPofpbfRlwM9VNWtuM87bdgCuAi4H7qRqYuf7ftP0XYFPgMEmbjHH8q4CdqIYSu4PqhrbGM0DfBk4HzpL0MPBHqv7uQ+9/EjgV2Kjso6WYS6G6OVV//XuoGvfdh34f2ycC76bqUnULVWN+EnAocPIYx9uWqs/h7VRdqva0/euy7sdU/SZvBM5i5ML2uLLueqqzZvuOsE3EVKu7jT8JeLukN0maj+rL6ZPA7yWtLGnDsu8nqK6YNtuGB+SGt8lMVGc8/x9Vf7l7gYMYuzP8gsD/UX1zexA4vyybWbabt2w3i6p422Sc468D/JaqgbuHqivDjLLuPOC/GradK66y7EfluK8Ztnz4DW/7Dlu/Wjnug1Q3+2zZsG6u7XnuDS7P7rvMXwBs3zC/L/DDhvmNqRrxB6j+IzqZcnMNVYO6UcO2ewHHNMwfXv5tHqDcGJApU6bumYbncKZM3TR1QRu/ZWljHyxt7mpl+epUxfnDpf3/edq41iaVDzIiImJKSbqR6kv6r8fbNiKiXdLtISKiTSRtp3+PLd04XVV3bBERUcmZ3y4maTvgkBFW3WR7tamOJyIiItojbXx9UvxGRERExMBodgiP2k2fPt0zZ86sO4yItrnkkkvutb1k3XF0QvI1+lFyNqJ3jJWvPVP8zpw5k9mzZ9cdRkTbSLpp/K16U/I1+lFyNqJ3jJWvueEtIiIiIgZGit+IiIiIGBgpfiMiIiJiYKT4jYiIiIiB0TM3vLXL2rsfXXcI0eMu+cYH6g5hYCRfY7KSr1MrORuTNRU5mzO/ERERETEwUvxGRERExMBI8RsRERERAyPFb0REREQMjBS/ERERETEwUvxGRERExMBI8RsRERERAyPFb0REREQMjBS/ERERETEwmi5+Jf24mWUR0T0kPV/SonXHERER0S1aOfO7WuOMpHmAtdsbTkS0g6TXSLoCuBy4UtJfJCVfIyJi4I1b/EraQ9LDwOqSHirTw8DdwE87HmFETMSPgI/bnml7BWAn4IiaY4qIiKjduMWv7a/YXhT4hu3nl2lR2y+0vccUxBgRrZtj+3dDM7YvAJ6uMZ6IiIiuMG+zG9reQ9KywAqN77N9ficCi4hJ+a2kQ4DjAQPbAOdJWgvA9qV1BhcREVGXpotfSV8F3gNcDcwpiw2MWvxKOhzYFLjb9ivLsr2AjwD3lM0+b/sXLUceEWN5dfm557Dla1Ll7YYjvSk5G9E7kq8RE9N08QtsCaxs+8kW3nMk8F3g6GHLv2X7my3sJyJaYPuNE3zrkSRnI3rFkSRfI1rWSvF7PTAf0HTxa/t8STNbDSoiJkfSl0Zabnvvsd6XnI3oHcnXiIlpZaizx4DLJB0i6aChaYLH/YSkyyUdLmnxCe4jIkb3aMM0B9gEmDmJ/SVnI3pH8jViDK0Uv6cD+wC/By5pmFp1MLAisAZwB3DAaBtK2lHSbEmz77nnntE2i4hhbB/QMO0HbAC8dIK7aypnk68RXSFtbMQ4Whnt4ShJCwIzbF8z0QPavmvotaTDgJ+Pse2hwKEAs2bN8kSPGREsBCw3kTc2m7PJ14j6pY2NGF8rjzfeDLgMOLPMryHp9FYPKGnphtktgStb3UdEjE3SFeWy5+WSrgKuAf5vgvtKzkb0iORrxPhaueFtL+C1wHkAti+TNOZlVEnHU11unS7pVqphlzaQtAbVcEs3Ah9tMeaIGN+mDa+fBu6yPe5DLpKzEb0j+RoxMa0Uv0/ZflBS47JnxnqD7W1HWPyjFo4ZERNg+yZJrwb+oyw6H7i8ifclZyN6RPI1YmJaueHtKknvBeaR9HJJ36G6+S0iuoyknYFjgReV6VhJn6w3qoiIiPq1Uvx+EliNapzf44GHgF06EFNETN6HgXVsf8n2l4DXUT31KSIiYqC1MtrDY8AXyhQR3U38+zHklNcaZduIiIiB0XTxK2kW8HmqgfKffZ/t1dsfVkRM0hHAnySdVua3IH0BIyIiWrrh7Vhgd+AKxrnRLSLqI2ka8EeqkVnWL4t3sP3n2oKKiIjoEq0Uv/fYbnlc34iYWrafkfQ922sCl9YdT0RERDdppfjdU9IPgXOobnoDwPapbY8qIibrHEnvBE61nSc3RUREFK0UvzsAqwDz8e9uDwZS/EZ0n48CuwJPS3qC6mY3235+vWFFRETUq5Xi9zW2V+5YJBHRNrYXrTuGiIiIbtRK8ft7Savavrpj0UTEpEiaB1jQ9iNl/nXA88rqP9t+uLbgIiIiukArxe/rgMsk3UDV53foMmqGOovoHl8D7ga+XuaPB64EFqC6+e2zNcUVERHRFVopfjfuWBQR0S5vAl7TMP+A7c0kCfhdTTFFRER0jaYfb2z7JuBW4CmqG92GpojoHtNsP90w/1moLtEAi9QTUkQ0S9LiknJFNaKDWnnC2yeBPYG7mHu0hyRpRPd4nqRFh/r22j4LQNJiVF0fIqLLSDoPeAdVm3wJcLekC23vWmtgEX2q6TO/wM7AyrZXs/2qMqXwjeguhwEnSpoxtEDSClR9f39YW1QRMZbFbD8EbAUcbXsdYKOaY4roW630+b0FeLBTgUTE5Nk+UNJjwAWSFqa6MfVh4Ku2D643uogYxbySlgbeDXyh7mAi+l0rxe/1wHmSzmDuJ7wd2PaoImLCbP8A+IGkRct8hjeL6G57A78CLrR9saSXAtfWHFNE32ql+L25TM/j3+OGRkQXkrQUsD+wDLCJpFWB19v+Ub2RRcRwtk8GTm6Yvx54Z30RRfS3potf218GkLRImX+kU0FFxKQdCRzBvy+h/h04EUjxG9FlJC0HfAdYryz6HbCz7VvriyqifzV9w5ukV0r6M3AVcJWkSySt1rnQImISpts+iTIySxn+bE69IUXEKI4ATqe6UrMM8LOyLCI6oJXRHg4FdrW9gu0VgE9T3VkeEd3nUUkvpIzFXR5znBtWI7rTkraPsP10mY4Elqw7qIh+1Uqf34Vt/2ZoxvZ55W7yiOg+u1KdSVpR0oVUDem76g0pIkZxn6T3UQ1JCLAtcF+N8UT0tVbO/F4v6X8lzSzTF6lGgBiVpMMl3S3pyoZlS0g6W9K15efiEw0+IkZm+1LgDcC6wEeB1WxfPt77krMRtfgQ1TBndwJ3UH1R3WG8NyVfIyamleL3Q1Rnj04FfgJML8vGciSw8bBlnwPOsf1y4JwyHxFtJGknYBHbV9m+ElhE0sebeOuRJGcjppTtm2y/w/aStl9kewvbNw+tl7THKG89kuRrRMuaKn4lzQOcavtTtteyvbbtXWz/c6z32T4fuH/Y4s2Bo8rro4AtWow5Isb3EdsPDM2UXP3IeG9KzkZ0pa1HWph8jZiYpopf23OAZyQt1oZjLmX7jvL6TmCpNuwzIuY2jyQNzZQvsBMdnzs5G1Evjb/Js5KvEeNo5Ya3R4ArJJ0NPDq00PanJnpw25bk0dZL2hHYEWDGjBkTPUzEIDoTOFHSIWX+o2XZpIyVs8nXiI4ZtZ0c801pYyNG1Eqf31OB/wXOBy5pmFp1V3mGOeXn3aNtaPtQ27Nsz1pyyYz6EtGCzwK/Af67TOcAn5ngvprK2eRrRMe0cuY3bWzEOFp5wttR42/VlNOBDwJfLT9/2qb9RkRh+xng4DJNVnI2ooMkLWH7/mHLXmL7hjJ78ghvG03yNWIc4xa/kk6y/W5JVzDCpRfbq4/x3uOBDYDpkm4F9qRKyJMkfRi4iWp4l4hog8nka3l/cjZi6v1M0ia2HwKQtCpwEvBKANv7j/Sm5GvExDRz5nfn8nPTVndue9tRVr2p1X1FRFMmnK+QnI2oyf5UBfDbgZWBo4HtxntT8jViYsYtfhvuGn0ncILt2zsbUkRMVPI1ovfYPkPSfMBZwKLAlrb/XnNYEX2rldEeFgXOlnQ/cCJwsu27OhNWRExS8jWiy0n6DnN3T1oMuA74hKRJjaYUEaNr5Ya3LwNflrQ6sA3wW0m32t6oY9FFxIQkXyN6wuxh8xMZQSkiWtTKmd8hd1MNnH0f8KL2hhMRbZZ8jehSQ6MoSVoYeKI8UGrooTTz1xlbRD9repxfSR+XdB7VeKEvpHp86ph3jkdEPZKvET3lHGDBhvkFgV/XFEtE32vlzO/ywC62L+tQLBHRPsnXiN6xgO1HhmZsPyJpoToDiuhnTZ/5tb0H1eONl5E0Y2jqYGwRMUElXxeRtAOApCUlvaTmsCJiZI9KWmtoRtLawOM1xhPR15o+8yvpE8BewF3AM2WxgVxKjegykvYEZlGNGXoEMB9wDLBenXFFxIh2AU6WdDvVo4xfTHWjakR0QCvdHnYBVrZ9X4diiYj22RJYE7gUwPbtkhatN6SIGIntiyWtQvVlFeAa20/VGVNEP2ul+L0FeLBTgUREW/3LtiUZnr2bPCK6iKQNbZ8raathq1Yq4/yeWktgEX2uleL3euA8SWcATw4ttH1g26OKiMk6SdIhwAskfQT4EHBYzTFFxNzeAJwLbDbCOgMpfiM6oJXi9+YyPa9MEdGlbH9T0puBh6gupX7J9tk1hxURDWzvWX7uUHcsEYOk1Se8IWkh2491LqSIaIdS7KbgjehSknYda32urEZ0RiujPbwe+BGwCDBD0quBj9r+eKeCi4jWSHqY6nLpiGw/fwrDiYixjXUT6qh5HBGT00q3h/8D3gqcDmD7L5L+sxNBRcTE2F4UQNI+wB3Aj6mGTtoOWLrG0CJimIYrqkcBO9t+oMwvDhxQY2gRfa3ph1wA2L5l2KI5bYwlItrnHba/b/th2w/ZPhjYvO6gImJEqw8VvgC2/0k1VGFEdEArxe8tktYFLGk+SbsBf+1QXBExOY9K2k7SPJKmSdoOeLTuoCJiRNPK2V4AJC1Ba1dmI6IFrSTXx4BvA8sCtwFnATt1IqiImLT3UuXrt6n6Dl5YlkVE9zkA+IOkk8v81sB+NcYT0ddaGe3hXqp+gyOStIftr7QlqoiYFNs3MkY3h+RrRPewfbSk2cCGZdFWtq+uM6aIftZSn99xbN3GfUVEZyVfI7qI7attf7dMKXwjOqidxa/auK+I6Kzka0REDKR2Fr8ZkzCidyRfIyJiILXzbtKWziRJuhF4mGq4tKdtz2pjLBExtpbP/CZnI3pH8jVidK084W0J2/cPW/YS2zeU2ZNHeNt43lhupIuINupQvkJyNqKXJF8jRtBKt4efSXr20aiSVgV+NjRve/92BhYRk5J8jYiIGEErxe/+VA3qIpLWpjpz9L5JHNvAWZIukbTjJPYTEc/V7nyF5GxEL0m+RoyilXF+z5A0H9XDLRYFtrT990kce33bt0l6EXC2pL/ZPr9xg5KwOwLMmDFjEoeKGCwdyFcYJ2eTrxFdJW1sxCjGLX4lfYe57wxfDLgO+IQkbH9qIge2fVv5ebek04DXAucP2+ZQ4FCAWbNm5e70iHF0Kl9h/JxNvkZ0j7SxEaNr5szv7GHzl0z2oJIWBqbZfri8fguw92T3GxHtz1dIzkb0kuRrxNjGLX5tHwXPJtMTtueU+XmA+Sd43KWA0yQNxXCc7TMnuK+IKDqUr5CcjeglydeIMbQyzu85wEbAI2V+Qar+hOu2elDb1wOvbvV9EdG0tuUrJGcjeknyNWJsrYz2sIDtoYaU8nqh9ocUEW2QfI2IiBhBK8Xvo5LWGpopwyc93v6QIqINkq8REREjaKXbwy7AyZJup3o06ouBbToRVERM2i4kXyMiIp6jlXF+L5a0CrByWXSN7ac6E1ZETEbyNSIiYmTNjPO7oe1zJW01bNVKZdzQUzsUW0S0KPkaERExtmbO/L4BOBfYbIR1BtKYRnSP5GtERMQYmhnnd8/yc4fOhxMRk5F8jYiIGFsz3R52HWu97QPbF05ETEbyNSIiYmzNdHtYdIx1eRZ4RHdJvkZERIyhmW4PXwaQdBSws+0HyvziwAEdjS4iWpJ8jYiIGFsrD7lYfaghBbD9T2DNtkcUEe2QfI2IiBhBK8XvtHL2CABJS9DaQzIiYuokXyMiIkbQSmN4APAHSSeX+a2B/dofUkS0QfI1IiJiBK084e1oSbOBDcuirWxf3ZmwImIykq8REREja+kyaGk804BG9IDka0RExHO10uc3IiIiIqKnpfiNiIiIiIGR4jciIiIiBkaK34iIiIgYGCl+IyIiImJgpPiNiIiIiIGR4jciIiIiBkZtxa+kjSVdI+kfkj5XVxwR0ZzkbETvSL5GjK6W4lfSPMD3gE2AVYFtJa1aRywRMb7kbETvSL5GjK2uM7+vBf5h+3rb/wJOADavKZaIGF9yNqJ3JF8jxlBX8bsscEvD/K1lWUR0p+RsRO9IvkaMYd66AxiLpB2BHcvsI5KuqTOeATIduLfuILqVvvnBdu1qhXbtqBskX2uTfB1DG/MVkrMxecnXcUxFG1tX8XsbsHzD/HJl2VxsHwocOlVBRUXSbNuz6o4jusq4OZt8rUfyNUaQNrZLJV+7Q13dHi4GXi7pJZKeB7wHOL2mWCJifMnZiN6RfI0YQy1nfm0/LekTwK+AeYDDbV9VRywRMb7kbETvSL5GjE22644huoykHcvlsIjocsnXiN6RfO0OKX4jIiIiYmDk8cYRERERMTBS/EZERETEwEjxGxEREREDI8VvzEXS0XXHEBGjk/RaSa8pr1eVtKukt9UdV0Q8l6RVJL1J0iLDlm9cV0yRG94GmqTh4z4KeCNwLoDtd0x5UBExKkl7AptQDVN5NrAO8BvgzcCvbO9XY3gR0UDSp4CdgL8CawA72/5pWXep7bVqDG+gdfXjjaPjlgOuBn4ImKr4nQUcUGdQETGqd1E1ovMDdwLL2X5I0jeBPwEpfiO6x0eAtW0/ImkmcIqkmba/TdXeRk3S7WGwzQIuAb4APGj7POBx27+1/dtaI4uIkTxte47tx4DrbD8EYPtx4Jl6Q4uIYabZfgTA9o3ABsAmkg4kxW+tUvwOMNvP2P4WsAPwBUnfJVcDIrrZvyQtVF6vPbRQ0mKk+I3oNndJWmNophTCmwLTgVfVFVSkz280kPR2YD3bn687loh4Lknz235yhOXTgaVtX1FDWBExAknLUV2tuXOEdevZvrCGsIIUvxERERExQNLtISIiIiIGRorfiIiIiBgYKX77nKTft7j9BpJ+3ql4ImJ0ydeI3pKc7U0pfvuc7XXrjiEimpN8jegtydnelOK3z0l6pPzcQNJ5kk6R9DdJx0pSWbdxWXYpsFXDexeWdLikiyT9WdLmZfm3JX2pvH6rpPMl5W8pYpKSrxG9JTnbmzKm62BZE1gNuB24EFhP0mzgMGBD4B/AiQ3bfwE41/aHJL0AuEjSr4E9gIsl/Q44CHib7YwxGtFeydeI3pKc7RH5JjFYLrJ9a0miy4CZwCrADbavdTXu3TEN278F+Jyky4DzgAWAGeXpUh8Bzga+a/u6KfsNIgZH8jWityRne0TO/A6WxsHx5zD+v7+Ad9q+ZoR1rwLuA5ZpU2wRMbfka0RvSc72iJz5jb8BMyWtWOa3bVj3K+CTDf2W1iw/VwA+TXWJZxNJ60xhvBGDLPka0VuSs10oxe+As/0EsCNwRumMf3fD6n2A+YDLJV0F7FOS9EfAbrZvBz4M/FDSAlMcesTASb5G9JbkbHfK440jIiIiYmDkzG9EREREDIwUvxERERExMFL8RkRERMTASPEbEREREQMjxW9EREREDIwUvxERERExMFL8RkRERMTASPEbEREREQMjxW9EREREDIwUvxERERExMFL8RkRERMTASPEbEREREQMjxW+XkbS9pAsa5h+R9NJx3jNTkiXN2/kII2I0vZq/kraTdFZdx49oRjfnl6T1JF1bYtqik8eKyUux1OVsL1J3DO0iaSZwAzCf7adrDiei43olf20fCxxbdxwRreiy/Nob+K7tb9cdyHCS9gJeZvt9dcfSLXLmN7pKzl5HtC55E9E5TebXCsBV7dq/pHkmu48YXYrfGklaXtKpku6RdJ+k746wjSW9rLxeUNIBkm6S9KCkCyQtOMJ73inpRkmvHOf460v6vaQHJN0iafuyfDFJR5e4bpL0RUnTyrq9JB3TsI+5LilJOk/SPpIulPSwpLMkTS+bn19+PlAuDb2+XMa6UNK3JN0H7C3pfkmvajjGiyQ9JmnJVj7fiE6qM38b8u7Dkm4Gzi3LPyTpr5L+KelXklZoeM9bJF1Tjv19Sb+V9F9l3fDLyetKurhse7GkdRvWjZXjEW3RS/kl6TrgpcDPSts2f2lHfyTpDkm3SdpXpaAdod3bS9KRkg6W9AtJjwJvlLSMpJ+Uz+AGSZ9qiHEvSadIOkbSQ8D2o/wuGwOfB7Ypsf1F0taSLhm23a6SflpeHynpB5LOLjn+22H/l6xS1t1f/k9592ifZbdK8VuTkgQ/B24CZgLLAieM87ZvAmsD6wJLAJ8Bnhm23x2ArwEb2b5yjOOvAPwS+A6wJLAGcFlZ/R1gMapkfgPwAWCHJn81gPeW7V8EPA/YrSz/z/LzBbYXsf2HMr8OcD2wFLAP1efQeHlmW+Ac2/e0EENEx9Sdvw3eALwCeKukzakaua2ocvp3wPFlv9OBU4A9gBcC15Q4RvrdlgDOAA4q2x4InCHphQ2bjZbjEZPWa/lle0XgZmCz0rY9CRwJPA28DFgTeAvwXw37bmz39ivL3lteLwr8HvgZ8Jfy+78J2EXSWxv2sTlVXr+AUbot2T4T2B84scT2auB04CWSXtGw6fuBoxvmt6Nqj6dT1QbHAkhaGDgbOI4q/98DfF/SqiN+gt3KdqYaJuD1wD3AvMOWbw9c0DBvquSZBjwOvHqEfc0s2+0GXA0s18Tx9wBOG2H5PMC/gFUbln0UOK+83gs4ZoRjz1vmzwO+2LD+48CZI23b8PvePCyGdaj+I1GZnw28u+5/s0yZhqYuyN+h97y0YdkvgQ83zE8DHqO6HPsB4A8N6wTcAvzX8LipGsGLhh3vD8D25fWoOZ4pUzumXsuvMn8jVVENVUH7JLBgw/bbAr9p+D2Gt3tHAkc3zK8zwjZ7AEeU13sB5zf5ee5FQ7tdlh0M7Fderwb8E5i/IZYTGrZdBJgDLA9sA/xu2L4OAfas+++mlSlnfuuzPHCTm7/xazqwAHDdGNvsDnzP9q1NHn+kfU0H5qP6xj3kJqpvns26s+H1Y1SJM5ZbGmds/6m8bwNJq1D953Z6C8eP6LS683dIY+6sAHxbVTemB4D7qYrcZYFlGrd11WKNdpxlmDv/4bn/B7Sa4xGt6LX8Gm4Fqnb0jobtD6E6UzrSvkc73jJD7y/7+DxVYT3WPpp1FPBeSaL6wnuSqzPWz9m37Ueoft9lSlzrDItrO+DFk4hlyqWDdH1uAWZImrfJBL8XeAJYkeoyyEjeApwp6U7bP2ni+K8d5ThPUf2BX12WzQBuK68fBRZq2L6VP3i3sPwoqq4PdwKn2H6iheNEdFrd+TukMXduoTqT85zLn5JeDizXMK/G+WFup8r/RjOAM5uMKWKyeiq/RnAL1Znf6WPEP1K7N/x4N9h+eZPxjeU529n+o6R/Af9B1d3ivcM2WX7ohaRFqLqS3F7i+q3tNzd57K6UM7/1uQi4A/iqpIUlLSBpvdE2tv0McDhwYOkEP4+qG8bmb9jsKmBj4HuS3jHO8Y8FNpL0bknzSnqhpDVszwFOAvaTtGjpG7wrMHST22XAf0qaIWkxqsswzbqHqg/WmOMyFscAW1IVwEePs23EVKs7f0fyA2APSavBszeubl3WnQG8StIWqm5O3YnRv7j+AlhJ0nvL/w3bAKtS9cGMmAq9ll/D47kDOAs4QNLzJU2TtKKkN7RwvIuAhyV9VtXNfPNIeqWk10wg9ruAmSo3rjc4Gvgu8JTtC4ate5uqm+KfR9X394+2b6H6f2AlSe+XNF+ZXjOs/3DXS/Fbk1JkbkZ1Sf9mqkuQ24zztt2AK4CLqS5BfI1h/4a2/wJsChwmaZMxjn8z8Dbg02VflwGvLqs/SXWG93rgAqqO7YeX950NnAhcDlxCCw2i7ceoOvNfWC6XvG6MbW8BLqX6xvq7Zo8RMRXqzt9RYjqt7PMEVXd/XwlsUtbdC2wNfB24j6qYnU11dmr4fu4rMXy6bPsZYNOyj4iO67X8GsUHqG4GvZqqP+0pwNItHG9OiXUNqvHx7wV+SHUzeqtOLj/vk3Rpw/IfA6/k3ye3Gh0H7En1Wa5NuQnd9sNUZ9HfQ3Um+E6qz2X+EfbRtYZuKIroOpIOB263/cW6Y4noJ+UM0K3AdrZ/U3c8ETH1VA0Fdzewlu1rG5YfCdzaz21v+vxGV1L1NLitqIaIiYhJKkMk/YnqrvjdqW7W+WOtQUVEnf4buLix8B0U6fbQxyRtp2pQ6+HThJ5CM1Uk7UN1Sekbtm+oO56IOnQgf19PdTf8vVSXlLew/XjbAo7oIb3aPo5G0i9H+X0+P8r2NwI7U3VvGjjp9hARERERAyNnfiMiIiJiYPRMn9/p06d75syZdYcR0TaXXHLJvbaXrDuOTki+Rj9Kzkb0jrHytWeK35kzZzJ79uy6w4hoG0nDn6LVN5Kv0Y+SsxG9Y6x8TbeHiIiIiBgYKX4jIiIiYmCk+I2IiIiIgdEzfX6ju9y896vqDqGrzfjSFXWHEPGs5Ov4krPtsfbuR9cdQvS4S77xgY4fI2d+IyIiImJgpPiNiIiIiIGR4jciIiIiBkaK34iIiIgYGCl+IyIiImJgpPiNiIiIiIGR4jciIiIiBkaK34iIiIgYGCl+IyIiImJgpPiNiIiokaSVJJ0j6coyv7qkL9YdV0S/SvEbERFRr8OAPYCnAGxfDryn1ogi+liK34iIiHotZPuiYcueriWSiAGQ4jciIqJe90paETCApHcBd9QbUkT/mrfuACIiIgbcTsChwCqSbgNuALarN6SI/pXiNyIioka2rwc2krQwMM32w3XHFNHPOtrtQdLhku4euoO1LFtC0tmSri0/F+9kDBHRvORsxNST9EJJBwG/A86T9G1JL2zifSPl616SbpN0WZne1snYI3pRp/v8HglsPGzZ54BzbL8cOKfMR0R3OJLkbMRUOwG4B3gn8K7y+sQm3nckz81XgG/ZXqNMv2hblBF9oqPFr+3zgfuHLd4cOKq8PgrYopMxRETzkrMRtVja9j62byjTvsBS471plHyNiHHUMdrDUraH7mK9kzESXNKOkmZLmn3PPfdMTXQRMVxTOZt8jZiwsyS9R9K0Mr0b+NUk9vcJSZeXbhGjdlNKzsagqnWoM9umDO0yyvpDbc+yPWvJJZecwsgiYiRj5WzyNWLCPgIcBzxZphOAj0p6WNJDLe7rYGBFYA2q4dIOGG3D5GwMqjpGe7hL0tK275C0NHB3DTFERPOSsxEdZHvRNu7rrqHXkg4Dft6ufUf0izrO/J4OfLC8/iDw0xpiiIjmJWcjOkjSTyS9TdKk2+TyBXXIlsCVo20bMaiaTjRJr2p155KOB/4ArCzpVkkfBr4KvFnStcBGZT4iukByNqIWB1M91OJaSV+VtHIzbxolX78u6QpJlwNvBP6nY1FH9KhWuj18X9L8VEOrHGv7wfHeYHvbUVa9qYXjRsQUSc5GTD3bvwZ+LWkxYNvy+hbgMOAY20+N8r6R8vVHnYs0oj80febX9n9QfTNdHrhE0nGS3tyxyCIiIgZEeajF9sB/AX8Gvg2sBZxdY1gRfamlG95sXyvpi8Bs4CBgTUkCPm/71E4EGBER0c8knQasDPwY2KxhaMETJc2uL7KI/tR08StpdWAH4O1U30Q3s32ppGWo+hyl+I3oIpKWBVagIc/LoPgR0V0OG/4kNknz237S9qy6goroV62c+f0O8EOqs7yPDy20fXs5GxwRXULS14BtgKuBOWWxgRS/Ed1nX2D4Y4j/QNXtISLarKniV9I8wG22fzzS+tGWR0RttgBWtv1k3YFExMgkvRhYFlhQ0pqAyqrnAwvVFlhEn2uq+LU9R9Lykp5n+1+dDioiJu16YD6qp0VFRHd6K9VNbstRPYltqPh9CPh8TTFF9L1Wuj3cAFwo6XTg0aGFtg9se1QRMVmPAZdJOoeGAtj2p+oLKSIa2T4KOErSO23/ZLTtJH2wbBsRbdBK8XtdmaYBQ49idNsjioh2OL1MEdHlxip8i52BFL8RbdJK8Xu17ZMbF0jaus3xREQb2D5K0vOAlcqia0YbKD8iup7G3yQimtXKc8T3aHJZRNRM0gbAtcD3gO8Df5f0n3XGFBETlqusEW007plfSZsAbwOWlXRQw6rnA093KrCImJQDgLfYvgZA0krA8cDatUYVERORM78RbdTMmd/bqZ7o9gRwScN0OtWdqhHRfeYbKnwBbP+davSHiOgikqZJevc4m104JcFEDIhxz/za/gvwF0nHpc9gRM+YLemHwDFlfjuqL7ER0UVsPyPpM8BJY2zziSkMKaLvtXLD22sl7cW/H5cqwLZf2onAImJS/hvYCRga2ux3VH1/I6L7/FrSbsCJzD2U6P31hRTRv1opfn8E/A9Vl4c542wbETUqT3Y7sEwR0d22KT93alhmICeXIjqgleL3Qdu/7FgkETFpkk6y/W5JVzDCHeK2V68hrIgYg+2X1B1DxCBppfj9jaRvAKcy9xOjLm17VBExUTuXn5vWGkVENE3SQsCuwAzbO0p6ObCy7Z/XHFpEX2ql+F2n/JzVsMzAhu0LJyImw/Yd5edNdccSEU07gqpL4bpl/jbgZCDFb0QHNF382n5jJwOJiPaR9DDP7fbwINWID5+2ff3URxURo1jR9jaStgWw/ZikjO0b0SFNF7+SvjTSctt7ty+ciGiT/wNuBY6jGpnlPcCKwKXA4cAGdQUWEc/xL0kLUr6wSlqRhu6FEdFerXR7eLTh9QJUfQr/2t5wIqJN3mH71Q3zh0q6zPZnJX2+tqgiYiR7AmcCy0s6FlgP2L7WiCL6WCvdHg5onJf0TeBXbY8oItrhsfLUqFPK/LuontIII4wCERH1kDQNWBzYCngd1ZWanW3fW2tgEX2smccbj2YhYLl2BRIRbbUd8H7gbuCu8vp95dJqnhYV0SVsPwN8xvZ9ts+w/fNmC19Jh0u6W9KVDcuWkHS2pGvLz8U7FnxEj2q6+JV0haTLy3QVcA1Vv8KI6DK2r7e9me3ptpcsr/9h+3HbF9QdX0TM5deSdpO0fClel5C0RBPvOxLYeNiyzwHn2H45cE6Zj4gGrfT5bRw39GngLttPtzmeiGgDSSsBBwNL2X6lpNWp+gHvW3NoEfFcE3rCm+3zJc0ctnhz/n1D61HAecBnJx1hRB9p+sxvGTf0BcBmwJbAqh2KKSIm7zBgD+ApANuXU434EBFdpPT5/ZztlwybJvpo46WGxvsG7gSWGuPYO0qaLWn2PffcM8HDRfSeVro97AwcC7yoTMdK+mSnAouISVnI9kXDluVKTUSXKX1+d+/Qvs0YN7jaPtT2LNuzllxyyU6EENGVWun28GFgHduPAkj6GvAH4DudCCwiJuXeMlbo0Lih7wLuGPstEVGTX0vaDTiRhmFFbd8/gX3dJWlp23dIWprqpteIaNBK8StgTsP8nLIsIrrPTsChwCqSbgNuoBoBIiK6z4T6/I7idOCDwFfLz59OLrSI/tNK8XsE8CdJp5X5LYAftT2iiJgUSfMAH7e9kaSFgWm2H647rogYme2XTOR9ko6nurltuqRbqR6W8VXgJEkfBm4C3t2uOCP6RSsPuThQ0nnA+mXRDrb/3JGoImLCbM+RtH55/eh420dEvSQtBOwKzLC9o6SXAyvb/vlY77O97Sir3tTuGCP6SdPFr6TXAVfZvrTMP1/SOrb/1LHoWrT27kfXHUJXu+QbH6g7hJg6f5Z0OnAyc/chPLW+kOaWfB1fcnZgHAFcAqxb5m+jyt0xi9+ImJhWnvB2MPBIw/wjZVlEdJ8FgPuADamGJ9yMucfqjojusaLtr/PvoQkfI/fURHRMSze8lWFTgGp4FkmtvD8ipojtHcZaL2kP21+ZqngiYkz/Ko8eHxqdZUXgyXpDiuhfrZz5vV7SpyTNV6adges7FVhEdNTWdQcQEc/aEzgTWF7SsVSPJf5MvSFF9K9Wit+PUfVHug24FVgH2LETQUVEx+WSakTNJK1XXp4PbAVsDxwPzLJ9Xk1hRfS9VkZ7uJsxHo+ay6gRPWXUpz5FxJQ5CFgb+IPttYAzao4nYiC0s8/u1kDTxa+kG4GHqR6W8bTtWW2MJSLG1vKZ3+RsRNs9JelQYDlJBw1faftTNcQU0ffaWfxO5DLqG23f28YYIgKQtMTwR6NKeontG8rsyRPcdXI2on02BTYC3ko11FlETIF2Fr+5jBrRPX4maRPbDwFIWhU4CXglgO396wwuIqB8kTxB0l9t/6XueCIGRSs3vI2n1TO/Bs6SdImk3DgX0V77UxXAi0ham+pM7/smuc/kbERnPC7pHElXAkhaXdIX6w4qol+18oS3dl9GXd/2bZJeBJwt6W+2zx+2/x0pI0rMmDGjxd1HDC7bZ0iaDzgLWBTY0vbfJ7nbMXM2+RoxYYcBuwOHANi+XNJxwL61RhXRp1o58/szSc8fmimXUX82NN/qZVTbt5WfdwOnAa8dYZtDbc+yPWvJJZdsZfcRA0nSdyQdVG6e2RBYDLgB+MRIN9S0YrycTb5GTNhCti8atuzpWiKJGACt9Pkduoz6dmBl4Ghgu4kcVNLCwDTbD5fXbwH2nsi+ImIus4fNt+UmmuRsREfdW57qNvSEt3cBd9QbUkT/amWc33ZeRl0KOE3SUAzH2T5zgvuKiML2UfBssfqE7Tllfh5g/knsOjkb0Tk7AYcCq0i6jepqzYROLkXE+MYtfiV9h7lHclgMuI7qMuqExiG0fT3w6lbfFxFNO4dqCKVHyvyCVF9c153IzpKzEZ1Rvph+3PZGjVdY6o4rop81c+a3I5dRI6KjFrA9VPhi+xFJC9UZUEQ8l+05ktYvrx+tO56IQTBu8dvBy6gR0TmPSlrL9qUAZbizx2uOKSJG9mdJp1ONmvRsAWz71PpCiuhfrdzw1tbLqBHRUbsAJ0u6nWoM7hcD29QaUUSMZgHgPqoRWoYYSPEb0QGtFL+5jBrRI2xfLGkVqpFZAK6x/VSdMUXEyGzvMNZ6SXvY/spUxRPR71oZ5/dRSWsNzeQyakT3kbRh+bkVsBmwUpk2K8siovdsXXcAEf2klTO/u5DLqBHd7g3AuVSF73C5jBrRm1R3ABH9pJVxfnMZNaLL2d6z/BzzMmpE9BSPv0lENKuZcX43tH3uCJdMVyrj/OZMUkSXkLTrWOttHzhVsURE27R85lfSjcDDwBzgaduz2h1URK9q5sxvLqNG9I5Fx1iXs0cRXUjSErbvH7bsJbZvKLMnT3DXb7R97+Sii+g/zYzzm8uoET3C9pcBJB0F7Gz7gTK/OHBAjaFFxOh+JmkT2w8BSFoVOAl4JYDt/esMLqLfNNPtIZdRI3rP6kOFL4Dtf0pas8Z4ImJ0+1MVwG+nuq/maGC7Se7TwFmSDBxi+9DhG0jaEdgRYMaMGZM8XETvaKbbQy6jRvSeaZIWt/1PqC6r0troLhExRWyfIWk+qgdHLQpsafvvk9zt+rZvk/Qi4GxJf7N9/rDjHgocCjBr1qy05zEwmun2kMuoEb3nAOAPkob6Cm4N7FdjPBExjKTvMPdJpMWA64BPlBvKPzXRfdu+rfy8W9JpwGuB88d+V8RgaOVMUC6jRvQI20dLms2/H5e6le2r64wpIp5j9rD5S9qxU0kLA9NsP1xevwXYux37jugHrRS/uYwa0UNKsZuCN6JL2T4Kni1Wn7A9p8zPA8w/iV0vBZwmCap2+jjbZ04y3Ii+0UrxmsuoERER7XcOsBHwSJlfkKr/77oT2Znt64FXtye0iP7TyhPechk1IiKi/RawPVT4YvsRSQvVGVBEP2up20Iuo0ZERLTdo5LWsn0pgKS1gcdrjimib6XPbkRERL12AU6WdDvVo4xfDGxTa0QRfSzFb0RERI1sXyxpFaoHXABcY/upOmOK6GcpfiMiImogaUPb50raatiqlco4v6fWElhEn0vxGxERUY83AOcCm42wzkCK34gOSPEbERFRA9t7lp871B1LxCBJ8RsREVEDSbuOtd72gVMVS8QgSfEbERFRj0XHWOcpiyJiwKT4jYiIqIHtLwNIOgrY2fYDZX5xqqeqRkQHTKs7gIiIiAG3+lDhC2D7n8Ca9YUT0d9S/EZERNRrWjnbC4CkJciV2YiOSXJFRETU6wDgD5JOLvNbA/vVGE9EX0vxGxERUSPbR0uaDWxYFm1l++o6Y4roZyl+IyIialaK3RS8EVMgfX4jIiIiYmCk+I2IiIiIgZHiNyIiIiIGRorfiIiIiBgYKX4jIiIiYmCk+I2IiIiIgZHiNyIiIiIGRm3Fr6SNJV0j6R+SPldXHBHRnORsRO9IvkaMrpbiV9I8wPeATYBVgW0lrVpHLBExvuRsRO9IvkaMra4zv68F/mH7etv/Ak4ANq8plogYX3I2onckXyPGUNfjjZcFbmmYvxVYZ/hGknYEdiyzj0i6Zgpia6fpwL11BzFE3/xg3SF0Uld91uypZrZaodNhtNG4OdsH+Qpd9nfUxznbVZ8z0G85OyhtbC/qvr/9LtPG//dGzde6it+m2D4UOLTuOCZK0mzbs+qOYxDks65fr+cr5O9oquRz7g79kLO9Jn/73aGubg+3Acs3zC9XlkVEd0rORvSO5GvEGOoqfi8GXi7pJZKeB7wHOL2mWCJifMnZiN6RfI0YQy3dHmw/LekTwK+AeYDDbV9VRywdlstJUyefdQclZ6PN8jl30ADlay/K334XkO26Y4iIiIiImBJ5wltEREREDIwUvxERERExMFL8RkRERMTA6OpxfiNGIumlwFZUQ/nMAf4OHGf7oVoDi4iIaCBpFaqHjvzJ9iMNyze2fWZ9kQ22nPmdApJ2qDuGfiHpU8APgAWA1wDzUxXBf5S0QX2RRcRIJM2S9BtJx0haXtLZkh6UdLGkNeuOL6JTSnv1U+CTwJWSGh8xvX89UQVktIcpIelm2zPqjqMfSLoCWMP2HEkLAb+wvYGkGcBPbacxjXE1nnWRtBhwINWXqSuB/7F9V53x9RNJFwF7Ai8Avk71+Z4i6U3AvrZfX2d8EZ1S2qvX235E0kzgFODHtr8t6c9pr+qTbg9tIuny0VYBS01lLANgXqruDvMDiwDYvlnSfLVGFb1kf2DokuMBwB3AZlTdaQ4BtqgnrL40n+1fAkj6mu1TAGyfI+mb9YYW0VHThro62L6xXJ08RdIKVLVB1CTFb/ssBbwV+Oew5QJ+P/Xh9K0fAhdL+hPwH8DXACQtCdxfZ2DRs2bZXqO8/pakD9YZTB96QtJbgMUAS9rC9v+T9AaqL7ER/eouSWvYvgygnAHeFDgceFWtkQ24FL/t83NgkaE/8kaSzpvyaPpUuVz0a+AVwAG2/1aW3wP8Z63BRS95kaRdqb6cPl+S/O8+YLkXor0+RtXd4RmqEwT/LelI4DbgIzXGFdFpHwCeblxg+2ngA5IOqSekgPT5jYgBJGnPYYu+b/seSS8Gvm77A3XE1a8kvQJYhtzxHhFdIMVvRAykDEE0Ncod7x8H/gasAexs+6dl3aW216oxvIgYQLm8FxEDR9InyRBEU+UjVP2qtwA2AP5X0s5lXW76iYgpl+K3z0lq6WY7SRtI+nmn4onoEjsCa6cgmxJz3fFO9XlvIulA8llHj0sb25tS/PY52+vWHUNEF0pBNnXukrTG0Ez53DcFppM73qPHpY3tTSl++5ykR8rPDSSdJ+kUSX+TdKwklXUbl2WXUo1zOvTehSUdLukiSX8eujQs6duSvlRev1XS+ZLytxS9JAXZ1PkAcGfjAttPl5sKM0JL9LS0sb0pQ50NljWB1YDbgQuB9STNBg4DNgT+AZzYsP0XgHNtf0jSC4CLyjBje1CNtfs74CDgbbafmbpfI2LSMgTRFLF96xjrLpzKWCI6LG1sj8g3icFyke1bSxJdBswEVgFusH1tGef0mIbt3wJ8TtJlwHnAAsAM249R3cRyNvBd29dN2W8Q0QYlD+4cZV0KsoiYiLSxPSJnfgfLkw2v5zD+v7+Ad9q+ZoR1rwLuoxq7MyIiYtClje0ROfMbfwNmSlqxzG/bsO5XwCcb+i2tWX6uAHya6hLPJpLWmcJ4IyIiekXa2C6U4nfA2X6CatinM0pn/LsbVu8DzAdcLukqYJ+SpD8CdrN9O/Bh4IeSFpji0CP6ToZNiugvaWO7U57wFhHRoyRtQNVIblpzKBERPSNnfiMiukSGTYqI6Lzc8BYR0Z0ybFJERAfk239ERHfKsEkRER2QM78REd0pwyZFRHRAzvxGRPSODJsUETFJKX4jInpEhk2KiJi8DHUWEREREQMjZ34jIiIiYmCk+I2IiIiIgZHiNyIiIiIGRorfiIiIiBgYKX4jIiIiYmCk+I2IiIiIgZHiNyIiIiIGxv8HI/A/gGZQcY4AAAAASUVORK5CYII=\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "# 分析用户点击环境变化是否明显,这里随机采样10个用户分析这些用户的点击环境分布\n",
+ "sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=10, replace=False)\n",
+ "sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)]\n",
+ "cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region','click_referrer_type']\n",
+ "for _, user_df in sample_users.groupby('user_id'):\n",
+ " plot_envs(user_df, cols, 2, 3)"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "# 分析用户点击环境变化是否明显,这里随机采样10个用户分析这些用户的点击环境分布\n",
- "sample_user_ids = np.random.choice(tst_click['user_id'].unique(), size=10, replace=False)\n",
- "sample_users = user_click_merge[user_click_merge['user_id'].isin(sample_user_ids)]\n",
- "cols = ['click_environment','click_deviceGroup', 'click_os', 'click_country', 'click_region','click_referrer_type']\n",
- "for _, user_df in sample_users.groupby('user_id'):\n",
- " plot_envs(user_df, cols, 2, 3)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以看出绝大多数数的用户的点击环境是比较固定的。思路:可以基于这些环境的统计特征来代表该用户本身的属性"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击新闻数量的分布"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:40:04.296033Z",
- "start_time": "2020-11-13T15:40:03.980868Z"
- }
- },
- "outputs": [
+ },
{
- "data": {
- "text/plain": [
- "[]"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看出绝大多数数的用户的点击环境是比较固定的。思路:可以基于这些环境的统计特征来代表该用户本身的属性"
]
- },
- "execution_count": 33,
- "metadata": {},
- "output_type": "execute_result"
},
{
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAASw0lEQVR4nO3da4yc1X3H8e8fr+/GVxZj1nZsgnshVCl0RYyS8iLkBm1qKpGIqCpWimSpJU1SWjXQvEjUV0nUQEMTkTghFamilIRQYVW0gQJRlRdxsgbCNYSNa8CLsZeLL/EFbHz6Yo6dsbPjZ9be2Znn+PuRrH2e85yZ55x9xr+ZOXP2TKSUkCSV64xuN0CS1FkGvSQVzqCXpMIZ9JJUOINekgrX1+0GAJx11llpxYoV3W6GJNXKpk2bXk4p9VfV64mgX7FiBUNDQ91uhiTVSkQ81049h24kqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSpcrYP+p1te5eb7nuGNQ4e73RRJ6lm1DvqHn3uNWx8c5tBhg16SWql10EuSqhn0klQ4g16SCmfQS1Lhigh6v99cklqrddBHdLsFktT7ah30kqRqBr0kFc6gl6TCGfSSVLgigt5JN5LUWq2DPnDajSRVqXXQS5KqGfSSVDiDXpIKV0TQJ9dAkKSWah30LoEgSdVqHfSSpGoGvSQVzqCXpMIZ9JJUuCKC3jk3ktRaEUEvSWrNoJekwhn0klS4toI+Iv4mIp6MiCci4jsRMSMiVkbExogYjog7I2Jarjs97w/n4ys62gNJ0glVBn1EDAAfBwZTShcCU4BrgM8Dt6SUzgdeA67LN7kOeC2X35LrSZK6pN2hmz5gZkT0AbOAbcC7gbvy8TuAq/L2mrxPPn55RGcXK3CpG0lqrTLoU0ojwD8Bz9MI+F3AJmBnSulQrrYVGMjbA8AL+baHcv1Fx99vRKyLiKGIGBodHT2pxnf4+UOSitDO0M0CGq/SVwLnArOBD5zqiVNK61NKgymlwf7+/lO9O0lSC+0M3bwH+L+U0mhK6SBwN/BOYH4eygFYCozk7RFgGUA+Pg94ZUJbLUlqWztB/zywOiJm5bH2y4GngIeAq3OdtcA9eXtD3icffzC5YLwkdU07Y/QbaXyo+jDweL7NeuBTwA0RMUxjDP72fJPbgUW5/Abgxg60W5LUpr7qKpBS+gzwmeOKNwOXjFH3APChU2/aOPh+QZJaqvVfxjrnRpKq1TroJUnVDHpJKpxBL0mFM+glqXBFBH1y2o0ktVTroHepG0mqVuuglyRVM+glqXAGvSQVzqCXpMIVEfSujSlJrdU66J10I0nVah30kqRqBr0kFc6gl6TCFRH0fhYrSa3VOujDNRAkqVKtg16SVM2gl6TCGfSSVDiDXpIKV0TQJ9dAkKSWah30TrqRpGq1DnpJUjWDXpIKZ9BLUuEMekkqXBFB75wbSWqt1kHvpBtJqlbroJckVTPoJalwBr0kFa6toI+I+RFxV0T8PCKejohLI2JhRNwfEc/mnwty3YiIWyNiOCIei4iLO9sFSdKJtPuK/kvAf6eUfgd4O/A0cCPwQEppFfBA3ge4AliV/60DbpvQFo/BpW4kqbXKoI+IecBlwO0AKaU3Uko7gTXAHbnaHcBVeXsN8K3U8GNgfkQsmeB2H2lcR+5WkkrSziv6lcAo8K8R8UhEfCMiZgOLU0rbcp2XgMV5ewB4oen2W3OZJKkL2gn6PuBi4LaU0kXAXn49TANAaqwTPK4BlIhYFxFDETE0Ojo6nptKksahnaDfCmxNKW3M+3fRCP7tR4Zk8s8d+fgIsKzp9ktz2TFSSutTSoMppcH+/v6Tbb8kqUJl0KeUXgJeiIjfzkWXA08BG4C1uWwtcE/e3gBcm2ffrAZ2NQ3xSJImWV+b9f4a+HZETAM2Ax+l8STx3Yi4DngO+HCuey9wJTAM7Mt1Oyq52o0ktdRW0KeUHgUGxzh0+Rh1E3D9qTWrPc65kaRq/mWsJBXOoJekwhn0klQ4g16SCldG0DvpRpJaqnXQu9SNJFWrddBLkqoZ9JJUOINekgpXRND7WawktVbroA8XQZCkSrUOeklSNYNekgpn0EtS4Qx6SSpcEUGfnHYjSS3VOuhdAkGSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQR9crUbSWqp1kHvpBtJqlbroJckVTPoJalwBr0kFc6gl6TCFRH0rnUjSa3VOuhd60aSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQS9k24kqbW2gz4ipkTEIxHxn3l/ZURsjIjhiLgzIqbl8ul5fzgfX9GhthOudiNJlcbziv4TwNNN+58HbkkpnQ+8BlyXy68DXsvlt+R6kqQuaSvoI2Ip8EfAN/J+AO8G7spV7gCuyttr8j75+OW5viSpC9p9Rf/PwN8Dh/P+ImBnSulQ3t8KDOTtAeAFgHx8V65/jIhYFxFDETE0Ojp6cq2XJFWqDPqI+GNgR0pp00SeOKW0PqU0mFIa7O/vn8i7liQ16WujzjuBP4mIK4EZwFzgS8D8iOjLr9qXAiO5/giwDNgaEX3APOCVCW95k+RiN5LUUuUr+pTSTSmlpSmlFcA1wIMppT8DHgKuztXWAvfk7Q15n3z8wdSpJHbkX5Iqnco8+k8BN0TEMI0x+Ntz+e3Aolx+A3DjqTVRknQq2hm6OSql9EPgh3l7M3DJGHUOAB+agLZJkiZAEX8ZK0lqrYig97NYSWqt1kHvZ7GSVK3WQS9JqmbQS1LhDHpJKpxBL0mFM+glqXC1DnpXP5akarUOeklSNYNekgpn0EtS4Qx6SSpcEUHvWjeS1Fqtg945N5JUrdZBL0mqZtBLUuEMekkqnEEvSYUrIugTTruRpFZqHfQudSNJ1Wod9JKkaga9JBXOoJekwhn0klS4IoLetW4kqbVaB72zbiSpWq2DXpJUzaCXpMIZ9JJUOINekgpXRNA76UaSWqt10IffMSVJlSqDPiKWRcRDEfFURDwZEZ/I5Qsj4v6IeDb/XJDLIyJujYjhiHgsIi7udCckSa2184r+EPC3KaULgNXA9RFxAXAj8EBKaRXwQN4HuAJYlf+tA26b8FZLktpWGfQppW0ppYfz9h7gaWAAWAPckavdAVyVt9cA30oNPwbmR8SSiW64JKk94xqjj4gVwEXARmBxSmlbPvQSsDhvDwAvNN1say47/r7WRcRQRAyNjo6Ot93HSK6BIEkttR30ETEH+D7wyZTS7uZjqZG040rblNL6lNJgSmmwv79/PDdtatNJ3UySTittBX1ETKUR8t9OKd2di7cfGZLJP3fk8hFgWdPNl+YySVIXtDPrJoDbgadTSjc3HdoArM3ba4F7msqvzbNvVgO7moZ4JEmTrK+NOu8E/hx4PCIezWX/AHwO+G5EXAc8B3w4H7sXuBIYBvYBH53IBkuSxqcy6FNKP4KWf5l0+Rj1E3D9KbZLkjRBav2XsUc450aSWisi6CVJrRn0klQ4g16SCmfQS1LhDHpJKlwRQe9SN5LUWq2DPlzsRpIq1TroJUnVDHpJKpxBL0mFM+glqXC1Dvq9rx8C4I1Dh7vcEknqXbUO+gWzpgJw2PmVktRSrYN+xtQpALzuK3pJaqnWQT+9rxH0Dt1IUmv1Dvqpjea//KvXu9wSSepdtQ76aVMazT/Dv5CVpJZqHfQLZ08DYMsre7vcEknqXbUO+vl51s3BNx2jl6RWah30s6Y1vtv82R2/6nJLJKl31TroARbNnsbTL+7udjMkqWfVPujPmTeDzS87Ri9JrdQ+6N+xchEAv9i+p8stkaTeVPug/+DblwDw9f/d3OWWSFJvqn3QX7R8AQDf27T16CJnkqRfq33QA/zjmrcB8MEv/6jLLZGk3lNE0F976QrOP3sOm0f3ctkXHmLPgYPdbpIk9Ywigh7g3o//IcsXzuL5V/fxe5+9jy/e94yLnUkSEKkH1nIfHBxMQ0NDE3JfN9//C2594Nmj++/53bN5/9vO4f0XnsPcGVMn5ByS1AsiYlNKabCyXmlBD7DvjUP8y4PD3P3wVrbv/vXKlufMncFFy+dz4cA8Llo2n7cNzGPO9D6mnOGiaJLq57QO+mbbdx9gw6Mv8pMtr/LY1p3HBP8R5/XPZmD+TM6aM53zz57DnOl9rDxrNnNnTmXZgpksmjO9I22TpFNh0Ldw8M3DPDGyi0ee38nW1/azY88Bntq2m9cPHmZk5/4T3nbx3OmcM3cG0Fg5c/nCWUePLVs46+hqmkDjiePMY58gBubPPPqtWJJ0qtoN+r4OnfwDwJeAKcA3Ukqf68R5TsbUKWdw0fIFR+ffN9v/xpvsP/gmW17Zy659B9m26wDbdx8AGn95u//gmwBseXkvz726j0de2ElKsGt/+7N8zpz+m7/y/jOnc+78mWPWnzltCr+1eM4J73PZgmOfZE7kwoF59I1jqGp63xTmzfKzDanOJjzoI2IK8BXgvcBW4KcRsSGl9NREn2uizZw2hZnTprQdmkfs2neQnfvfOLo/snM/o3uOHSJ6/pV9vLbvN58Qhkd/xd7XDx19Emm2bed+dux5nYd+vqPluQ8d7vw7sgWzpk7IO5HxPCFVmTV9CqvOPnNC7msiNJ6sZ0zqOd/aP4czZ3TktZpOwYy+KZzRY5/7deJRcgkwnFLaDBAR/w6sAXo+6E/WvFlTj3nV+5ZFsyft3Dt2H2DHnva+SvGpF3dz8HD7U04PvZl4YmQXE/EFXk+M7GbX/oPjevfTyos797PHv4JWj4qA8/tP/C682ccvX8UH335uB1vUmaAfAF5o2t8KvOP4ShGxDlgHsHz58g404/Rw9twZnD23vVeSFw7M63BrJs+BMd4Bdcure9/g+Vf3Teo5n3tlLzvHeIeo7npx535Gx/kd1vNmdn5otGvv+1JK64H10PgwtlvtUD310ofa586f2fIzlk5Zfd6iST2f6q0Tfxk7Aixr2l+ayyRJXdCJoP8psCoiVkbENOAaYEMHziNJasOED92klA5FxMeAH9CYXvnNlNKTE30eSVJ7OjJGn1K6F7i3E/ctSRqfYlavlCSNzaCXpMIZ9JJUOINekgrXE6tXRsQo8NxJ3vws4OUJbE4d2OfTg30+PZxKn9+SUuqvqtQTQX8qImKonWU6S2KfTw/2+fQwGX126EaSCmfQS1LhSgj69d1uQBfY59ODfT49dLzPtR+jlySdWAmv6CVJJ2DQS1Lhah30EfGBiHgmIoYj4sZut2e8ImJLRDweEY9GxFAuWxgR90fEs/nnglweEXFr7utjEXFx0/2szfWfjYi1TeV/kO9/ON920r/IMiK+GRE7IuKJprKO97HVObrY589GxEi+1o9GxJVNx27K7X8mIt7fVD7m4zsvAb4xl9+ZlwMnIqbn/eF8fMUkdZmIWBYRD0XEUxHxZER8IpcXe61P0Ofeu9YppVr+o7EE8i+B84BpwM+AC7rdrnH2YQtw1nFlXwBuzNs3Ap/P21cC/wUEsBrYmMsXApvzzwV5e0E+9pNcN/Jtr+hCHy8DLgaemMw+tjpHF/v8WeDvxqh7QX7sTgdW5sf0lBM9voHvAtfk7a8Cf5m3/wr4at6+BrhzEvu8BLg4b58J/CL3rdhrfYI+99y1ntT/9BP8S74U+EHT/k3ATd1u1zj7sIXfDPpngCVND6Rn8vbXgI8cXw/4CPC1pvKv5bIlwM+byo+pN8n9XMGxodfxPrY6Rxf73Oo//zGPWxrf43Bpq8d3DrmXgb5cfrTekdvm7b5cL7p0ze8B3ns6XOsx+txz17rOQzdjfQn5QJfacrIScF9EbIrGl6UDLE4pbcvbLwGL83ar/p6ofOsY5b1gMvrY6hzd9LE8TPHNpuGF8fZ5EbAzpXTouPJj7isf35XrT6o8jHARsJHT5Fof12fosWtd56AvwbtSShcDVwDXR8RlzQdT4+m66Pmvk9HHHvk93ga8Ffh9YBvwxa62pkMiYg7wfeCTKaXdzcdKvdZj9LnnrnWdg772X0KeUhrJP3cA/wFcAmyPiCUA+eeOXL1Vf09UvnSM8l4wGX1sdY6uSCltTym9mVI6DHydxrWG8ff5FWB+RPQdV37MfeXj83L9SRERU2kE3rdTSnfn4qKv9Vh97sVrXeegr/WXkEfE7Ig488g28D7gCRp9ODLTYC2NcT9y+bV5tsJqYFd+u/oD4H0RsSC/RXwfjXG8bcDuiFidZydc23Rf3TYZfWx1jq44EkTZn9K41tBo5zV5FsVKYBWNDx3HfHznV6wPAVfn2x//+zvS56uBB3P9jsu//9uBp1NKNzcdKvZat+pzT17rbnxoMYEfflxJ45PuXwKf7nZ7xtn282h8uv4z4Mkj7acxzvYA8CzwP8DCXB7AV3JfHwcGm+7rL4Dh/O+jTeWD+UH2S+DLdOGDOeA7NN6+HqQxxnjdZPSx1Tm62Od/y316LP8nXdJU/9O5/c/QNDOq1eM7P3Z+kn8X3wOm5/IZeX84Hz9vEvv8LhpDJo8Bj+Z/V5Z8rU/Q55671i6BIEmFq/PQjSSpDQa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKtz/A1/NmoIeUlAfAAAAAElFTkSuQmCC\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击新闻数量的分布"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "user_click_item_count = sorted(user_click_merge.groupby('user_id')['click_article_id'].count(), reverse=True)\n",
- "plt.plot(user_click_item_count)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以根据用户的点击文章次数看出用户的活跃度"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 34,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 34,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#点击次数在前50的用户\n",
- "plt.plot(user_click_item_count[:50])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "点击次数排前50的用户的点击次数都在100次以上。思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 35,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXEAAAD4CAYAAAAaT9YAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAARV0lEQVR4nO3dfYxc1X3G8eexd7ExEDAYjEPYrkOQFZekKUxT2lKgJQHHSuWGphJIDaRYWaUBKUitKJQqRWlTNYnaSFWippvaMonASZsUGSVtg4tSXKkYYqd+WQqYlwLxSzAvcYgIBYxP/5i7u6Nl786dmTt7z5n7/UjWzt6Z3fmdnfGjM+ece65DCAIApGlB1QUAALpHiANAwghxAEgYIQ4ACSPEASBhQ/P5ZMuWLQujo6Pz+ZQAkLydO3c+H0I4fbb75jXER0dHtWPHjvl8SgBInu2n8+5jOAUAEkaIA0DCCHEASBghDgAJI8QBIGFtQ9z2RtuHbU/Mct8f2g62l/WnPADAXIr0xDdJWjPzoO2zJV0u6ZmSawIAFNR2nXgIYZvt0Vnu+oKkmyRtKbuome59+Fnt/uGRfj/Nm5yy5Dh99FdHtWCB5/25AaCIrk72sb1O0oEQwm577oCzPSZpTJJGRka6eTrdt+85fW177lr3vpjcZv2SVafrnNNPnNfnBoCiOg5x20sk/YmaQylthRDGJY1LUqPR6OoKFJ9ed54+ve68bn60a9/ec1A33PnfeuMYF80AEK9uVqecI2mlpN22n5L0Nkk/sH1mmYXFggsfAYhZxz3xEMJeSWdMfp8FeSOE8HyJdVXOYhwcQPyKLDHcLOl+Sats77e9vv9lVW9yqD+IrjiAeBVZnXJ1m/tHS6smIpP9cIZTAMSMMzZzTPXECXEAESPE22A4BUDMCPFcTGwCiB8hnoPhFAApIMRz0A8HkAJCvA164gBiRojnaLcnDADEgBDPMbVOnNUpACJGiOdgYhNACgjxHNOn3QNAvAhxAEgYIZ5jchfDwHgKgIgR4nkYTgGQAEI8B7sYAkgBIZ5jep04KQ4gXoQ4ACSMEM/BcAqAFBDiOVgnDiAFhHiO6SWGFRcCAHMgxHOw/xWAFBDibXCyD4CYEeI5WGAIIAWEeB52MQSQAEI8x9TEJn1xABEjxHMwsQkgBYR4O3TEAUSMEM/BxCaAFBDiOSY3wGJiE0DMCPEcjIkDSAEhnoOr3QNIASHeBsMpAGLWNsRtb7R92PZEy7E/t73H9i7b99h+a3/LnH/sYgggBUV64pskrZlx7PMhhHeHEN4j6duSPlVyXRFgUBxA/IbaPSCEsM326IxjL7V8e4IGsMM6tKAZ4us3fV8LIp/lPGHRQm25/iKNnLak6lIAzLO2IZ7H9mckXSPpJ5J+Y47HjUkak6SRkZFun27erX7rW3TTmlX66f8drbqUOR088oq27DqoA0deIcSBGuo6xEMIt0q61fYtkm6Q9Gc5jxuXNC5JjUYjmR778MIF+sSl76i6jLa2P/mCtuw6yCoaoKbKWJ1yh6TfKeH3oAtTAz1kOFBLXYW47XNbvl0n6ZFyykG3yHCgntoOp9jeLOlSScts71dz2GSt7VWSjkl6WtLH+1kk8rE9AFBvRVanXD3L4Q19qAVdiHzhDIA+44zNxLE9AFBvhHjizGXkgFojxAcEGQ7UEyGevMmJTWIcqCNCPHFMbAL1RognjsvIAfVGiCfO7JkL1BohPiBYYgjUEyGeuKnhFDIcqCVCPHFMbAL1RognzmLvFKDOCPHEMa8J1BshPiA42QeoJ0J8QBDhQD0R4oljYhOoN0I8cUxsAvVGiCfOXGQTqDVCfEDQEwfqiRBPHEsMgXojxBNnMbMJ1BkhnjguzwbUGyGeOC6UDNQbIT4g6IkD9USIJ46JTaDeCPHkMbEJ1BkhnrjpiU364kAdEeKJox8O1BshPiDoiAP1RIgnbvJq9ywxBOqJEE8cwylAvbUNcdsbbR+2PdFy7PO2H7G9x/Zdtk/pa5XIxRmbQL0V6YlvkrRmxrGtks4LIbxb0j5Jt5RcFwpiP3Gg3obaPSCEsM326Ixj97R8u13Sh0uuCx26b99zOvLK61WX0bMVJy/W2netqLoMIBltQ7yA6yR9I+9O22OSxiRpZGSkhKdDq5OXDOvERUO6e/dB3b37YNXllGLvbZfrpMXDVZcBJKGnELd9q6Sjku7Ie0wIYVzSuCQ1Gg0+9Jfs5OOHteNP36dXjx6rupSe3fnAM/rsvz2io2/wNgGK6jrEbX9U0gclXRY4XbBSi4cXavHwwqrL6Nnxw80pGt5MQHFdhbjtNZJuknRJCOFn5ZaEurJZMAl0qsgSw82S7pe0yvZ+2+slfVHSSZK22t5l+8t9rhM1wD4wQOeKrE65epbDG/pQC2pu+gIXAIrijE1Eh444UBwhjniwDwzQMUIc0WBaE+gcIY5omEFxoGOEOKJDhgPFEeKIBpt5AZ0jxBGNqXXi9MWBwghxRIOJTaBzhDiiwQUugM4R4ogOGQ4UR4gjGtMTm8Q4UBQhjngwnAJ0jBBHNJjYBDpHiANAwghxRGPyohAMpwDFEeKIxvTWKaQ4UBQhjmiwThzoHCEOAAkjxBGN6b1TABRFiCManOwDdI4QRzToiQOdI8QRHTriQHGEOAAkjBBHNMxFNoGOEeKIxlSEk+FAYYQ4osHEJtA5QhzRoScOFEeIIxpmM1qgY4Q4osHV7oHOEeKIBhObQOcIcUSDXQyBzrUNcdsbbR+2PdFy7HdtP2T7mO1Gf0tE3TCcAhRXpCe+SdKaGccmJF0paVvZBaHOmNgEOjXU7gEhhG22R2cce1hqPcMO6N2C7O30e//wgIYWDv5I37lnnKg7P3Zh1WUgcW1DvFe2xySNSdLIyEi/nw4J++WVp+m6X1upV15/o+pS+m7vgSP6rydeqLoMDIC+h3gIYVzSuCQ1Gg0GO5Hr5CXD+tRvra66jHnxha37NHHgparLwAAY/M+sQISmV+LQr0FvCHGgQmQ4elVkieFmSfdLWmV7v+31tj9ke7+kX5H0Hdvf7XehwCCZuhRdxXUgfUVWp1ydc9ddJdcC1AYLu1AWhlOACkxvMUBfHL0hxIEKsHc6ykKIAxWiI45eEeJABSbPdmafGPSKEAeAhBHiQAXYdhdlIcSBCnApOpSFEAcqRE8cvSLEgQpwPVGUhRAHKsBgCspCiAMVYGITZSHEgQqwARbKQogDFWLvFPSKEAcqwN4pKAshDgAJI8SBCkztnUJXHD0ixIEKTC0xJMTRI0IcqBAn+6BXhDhQAdaJoyyEOFABzthEWQhxoALTF4UAekOIAxWYHk4hxtEbQhyoEBGOXhHiQAUmx8TpiKNXhDhQBTO1iXIQ4kAFpnriDKigR4Q4UAFPpzjQE0IcqBAZjl4R4kAFpi4KQYqjR4Q4UAHmNVGWtiFue6Ptw7YnWo6danur7ceyr0v7WyYwWJjYRFmK9MQ3SVoz49jNku4NIZwr6d7sewAFsQEWyjLU7gEhhG22R2ccXifp0uz27ZL+Q9Ifl1kYUAf/sveQli45ruoyorRoeIHev3q5Fg0trLqUqLUN8RzLQwiHsts/krQ874G2xySNSdLIyEiXTwcMlmUnLpIk/cV3Hq64krj9/Ucu0BU/f2bVZUSt2xCfEkIItnM/FIYQxiWNS1Kj0eDDIyDpsncu1/ZbLtNrR49VXUqUnn7xZX1kw4N6lb9PW92G+LO2V4QQDtleIelwmUUBdXDmyYurLiFar73RDG92eWyv2yWGd0u6Nrt9raQt5ZQDACzB7ESRJYabJd0vaZXt/bbXS/orSe+3/Zik92XfA0ApyPDiiqxOuTrnrstKrgUAJLVc+YjRlLY4YxNAtDgZqj1CHEB0uGhGcYQ4gOgwsVkcIQ4gOuzyWBwhDiA6U3vLVFtGEghxANHiZJ/2CHEA0SLC2yPEAUSHic3iCHEA0TGD4oUR4gCiw5WPiiPEAUSLec32CHEA0WE0pThCHEB0zD6GhRHiAKLDhaSLI8QBRIeJzeIIcQDRoifeHiEOID5MbBZGiAOIDhObxRHiAKJjrgpRGCEOIDrTE5tohxAHEC064u0R4gCiM321e1K8HUIcQHSY1iyOEAcQHfZOKY4QBxAdLpRcHCEOIFpkeHuEOID4TG2ARYy3Q4gDiA7X2CyOEAcQHTK8OEIcQHSm14lXXEgCCHEA0WI/8fZ6CnHbn7Q9Yfsh2zeWVBOAmmP/q+K6DnHb50n6mKT3SvoFSR+0/Y6yCgNQX0xsFjfUw8++U9IDIYSfSZLt+yRdKelzZRQGoL4mT/b5yn8+qW/u3F9xNeX4yyvfpV8aPbX039tLiE9I+ozt0yS9ImmtpB0zH2R7TNKYJI2MjPTwdADqYvHwAn38knP0zIsvV11KaY4fXtiX3+teFtPbXi/pE5JelvSQpFdDCDfmPb7RaIQdO96U8wCAOdjeGUJozHZfTxObIYQNIYQLQggXS/qxpH29/D4AQGd6GU6R7TNCCIdtj6g5Hn5hOWUBAIroKcQlfSsbE39d0vUhhCO9lwQAKKqnEA8h/HpZhQAAOscZmwCQMEIcABJGiANAwghxAEhYTyf7dPxk9nOSnu7yx5dJer7EclJAm+uBNtdDL23+uRDC6bPdMa8h3gvbO/LOWBpUtLkeaHM99KvNDKcAQMIIcQBIWEohPl51ARWgzfVAm+uhL21OZkwcAPBmKfXEAQAzEOIAkLAkQtz2GtuP2n7c9s1V19ML20/Z3mt7l+0d2bFTbW+1/Vj2dWl23Lb/Nmv3Htvnt/yea7PHP2b72qraMxvbG20ftj3Rcqy0Ntq+IPsbPp79bOVXZMxp8222D2Sv9S7ba1vuuyWr/1HbV7Qcn/W9bnul7Qey49+wfdz8tW52ts+2/T3b/5NdLP2T2fGBfa3naHN1r3UIIep/khZKekLS2yUdJ2m3pNVV19VDe56StGzGsc9Jujm7fbOkz2a310r6VzUv/n2hmtc0laRTJT2ZfV2a3V5addta2nOxpPMlTfSjjZIezB7r7Gc/EGmbb5P0R7M8dnX2Pl4kaWX2/l4413td0j9Kuiq7/WVJfxBBm1dIOj+7fZKaF4VZPciv9Rxtruy1TqEn/l5Jj4cQngwhvCbp65LWVVxT2dZJuj27fbuk3245/tXQtF3SKbZXSLpC0tYQwoshhB9L2ippzTzXnCuEsE3SizMOl9LG7L63hBC2h+a7/Kstv6syOW3Os07S10MIr4YQ/lfS42q+z2d9r2e9z9+U9M3s51v/fpUJIRwKIfwgu/1TSQ9LOksD/FrP0eY8fX+tUwjxsyT9sOX7/Zr7jxa7IOke2zvdvIi0JC0PIRzKbv9I0vLsdl7bU/yblNXGs7LbM4/H6oZs6GDj5LCCOm/zaZKOhBCOzjgeDdujkn5R0gOqyWs9o81SRa91CiE+aC4KIZwv6QOSrrd9ceudWY9joNd91qGNmb+TdI6k90g6JOmvK62mT2yfKOlbkm4MIbzUet+gvtaztLmy1zqFED8g6eyW79+WHUtSCOFA9vWwpLvU/Fj1bPbRUdnXw9nD89qe4t+krDYeyG7PPB6dEMKzIYQ3QgjHJH1Fzdda6rzNL6g59DA043jlbA+rGWZ3hBD+OTs80K/1bG2u8rVOIcS/L+ncbMb2OElXSbq74pq6YvsE2ydN3pZ0uaQJNdszOSN/raQt2e27JV2TzepfKOkn2cfU70q63PbS7GPb5dmxmJXSxuy+l2xfmI0fXtPyu6IyGWSZD6n5WkvNNl9le5HtlZLOVXMCb9b3etab/Z6kD2c/3/r3q0z2998g6eEQwt+03DWwr3Vemyt9rauc6S36T81Z7X1qzubeWnU9PbTj7WrOQu+W9NBkW9QcB7tX0mOS/l3SqdlxS/pS1u69khotv+s6NSdJHpf0+1W3bUY7N6v5kfJ1Ncf01pfZRkmN7D/JE5K+qOzM4wjb/LWsTXuy/8wrWh5/a1b/o2pZcZH3Xs/eOw9mf4t/krQogjZfpOZQyR5Ju7J/awf5tZ6jzZW91px2DwAJS2E4BQCQgxAHgIQR4gCQMEIcABJGiANAwghxAEgYIQ4ACft/AbwTsfQSxAYAAAAASUVORK5CYII=\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#点击次数排名在[25000:50000]之间\n",
- "plt.plot(user_click_item_count[25000:50000])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻点击次数分析"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:42:14.526476Z",
- "start_time": "2020-11-13T15:42:14.463642Z"
- }
- },
- "outputs": [],
- "source": [
- "item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'].count(), reverse=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T15:42:16.198000Z",
- "start_time": "2020-11-13T15:42:16.044455Z"
- }
- },
- "outputs": [
+ },
{
- "data": {
- "text/plain": [
- "[]"
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:40:04.296033Z",
+ "start_time": "2020-11-13T15:40:03.980868Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 33,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXoAAAD4CAYAAADiry33AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAASw0lEQVR4nO3da4yc1X3H8e8fr+/GVxZj1nZsgnshVCl0RYyS8iLkBm1qKpGIqCpWimSpJU1SWjXQvEjUV0nUQEMTkTghFamilIRQYVW0gQJRlRdxsgbCNYSNa8CLsZeLL/EFbHz6Yo6dsbPjZ9be2Znn+PuRrH2e85yZ55x9xr+ZOXP2TKSUkCSV64xuN0CS1FkGvSQVzqCXpMIZ9JJUOINekgrX1+0GAJx11llpxYoV3W6GJNXKpk2bXk4p9VfV64mgX7FiBUNDQ91uhiTVSkQ81049h24kqXAGvSQVzqCXpMIZ9JJUOINekgpn0EtS4Qx6SSpcrYP+p1te5eb7nuGNQ4e73RRJ6lm1DvqHn3uNWx8c5tBhg16SWql10EuSqhn0klQ4g16SCmfQS1Lhigh6v99cklqrddBHdLsFktT7ah30kqRqBr0kFc6gl6TCGfSSVLgigt5JN5LUWq2DPnDajSRVqXXQS5KqGfSSVDiDXpIKV0TQJ9dAkKSWah30LoEgSdVqHfSSpGoGvSQVzqCXpMIZ9JJUuCKC3jk3ktRaEUEvSWrNoJekwhn0klS4toI+Iv4mIp6MiCci4jsRMSMiVkbExogYjog7I2Jarjs97w/n4ys62gNJ0glVBn1EDAAfBwZTShcCU4BrgM8Dt6SUzgdeA67LN7kOeC2X35LrSZK6pN2hmz5gZkT0AbOAbcC7gbvy8TuAq/L2mrxPPn55RGcXK3CpG0lqrTLoU0ojwD8Bz9MI+F3AJmBnSulQrrYVGMjbA8AL+baHcv1Fx99vRKyLiKGIGBodHT2pxnf4+UOSitDO0M0CGq/SVwLnArOBD5zqiVNK61NKgymlwf7+/lO9O0lSC+0M3bwH+L+U0mhK6SBwN/BOYH4eygFYCozk7RFgGUA+Pg94ZUJbLUlqWztB/zywOiJm5bH2y4GngIeAq3OdtcA9eXtD3icffzC5YLwkdU07Y/QbaXyo+jDweL7NeuBTwA0RMUxjDP72fJPbgUW5/Abgxg60W5LUpr7qKpBS+gzwmeOKNwOXjFH3APChU2/aOPh+QZJaqvVfxjrnRpKq1TroJUnVDHpJKpxBL0mFM+glqXBFBH1y2o0ktVTroHepG0mqVuuglyRVM+glqXAGvSQVzqCXpMIVEfSujSlJrdU66J10I0nVah30kqRqBr0kFc6gl6TCFRH0fhYrSa3VOujDNRAkqVKtg16SVM2gl6TCGfSSVDiDXpIKV0TQJ9dAkKSWah30TrqRpGq1DnpJUjWDXpIKZ9BLUuEMekkqXBFB75wbSWqt1kHvpBtJqlbroJckVTPoJalwBr0kFa6toI+I+RFxV0T8PCKejohLI2JhRNwfEc/mnwty3YiIWyNiOCIei4iLO9sFSdKJtPuK/kvAf6eUfgd4O/A0cCPwQEppFfBA3ge4AliV/60DbpvQFo/BpW4kqbXKoI+IecBlwO0AKaU3Uko7gTXAHbnaHcBVeXsN8K3U8GNgfkQsmeB2H2lcR+5WkkrSziv6lcAo8K8R8UhEfCMiZgOLU0rbcp2XgMV5ewB4oen2W3OZJKkL2gn6PuBi4LaU0kXAXn49TANAaqwTPK4BlIhYFxFDETE0Ojo6nptKksahnaDfCmxNKW3M+3fRCP7tR4Zk8s8d+fgIsKzp9ktz2TFSSutTSoMppcH+/v6Tbb8kqUJl0KeUXgJeiIjfzkWXA08BG4C1uWwtcE/e3gBcm2ffrAZ2NQ3xSJImWV+b9f4a+HZETAM2Ax+l8STx3Yi4DngO+HCuey9wJTAM7Mt1Oyq52o0ktdRW0KeUHgUGxzh0+Rh1E3D9qTWrPc65kaRq/mWsJBXOoJekwhn0klQ4g16SCldG0DvpRpJaqnXQu9SNJFWrddBLkqoZ9JJUOINekgpXRND7WawktVbroA8XQZCkSrUOeklSNYNekgpn0EtS4Qx6SSpcEUGfnHYjSS3VOuhdAkGSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQR9crUbSWqp1kHvpBtJqlbroJckVTPoJalwBr0kFc6gl6TCFRH0rnUjSa3VOuhd60aSqtU66CVJ1Qx6SSqcQS9JhTPoJalwRQS9k24kqbW2gz4ipkTEIxHxn3l/ZURsjIjhiLgzIqbl8ul5fzgfX9GhthOudiNJlcbziv4TwNNN+58HbkkpnQ+8BlyXy68DXsvlt+R6kqQuaSvoI2Ip8EfAN/J+AO8G7spV7gCuyttr8j75+OW5viSpC9p9Rf/PwN8Dh/P+ImBnSulQ3t8KDOTtAeAFgHx8V65/jIhYFxFDETE0Ojp6cq2XJFWqDPqI+GNgR0pp00SeOKW0PqU0mFIa7O/vn8i7liQ16WujzjuBP4mIK4EZwFzgS8D8iOjLr9qXAiO5/giwDNgaEX3APOCVCW95k+RiN5LUUuUr+pTSTSmlpSmlFcA1wIMppT8DHgKuztXWAvfk7Q15n3z8wdSpJHbkX5Iqnco8+k8BN0TEMI0x+Ntz+e3Aolx+A3DjqTVRknQq2hm6OSql9EPgh3l7M3DJGHUOAB+agLZJkiZAEX8ZK0lqrYig97NYSWqt1kHvZ7GSVK3WQS9JqmbQS1LhDHpJKpxBL0mFM+glqXC1DnpXP5akarUOeklSNYNekgpn0EtS4Qx6SSpcEUHvWjeS1Fqtg945N5JUrdZBL0mqZtBLUuEMekkqnEEvSYUrIugTTruRpFZqHfQudSNJ1Wod9JKkaga9JBXOoJekwhn0klS4IoLetW4kqbVaB72zbiSpWq2DXpJUzaCXpMIZ9JJUOINekgpXRNA76UaSWqt10IffMSVJlSqDPiKWRcRDEfFURDwZEZ/I5Qsj4v6IeDb/XJDLIyJujYjhiHgsIi7udCckSa2184r+EPC3KaULgNXA9RFxAXAj8EBKaRXwQN4HuAJYlf+tA26b8FZLktpWGfQppW0ppYfz9h7gaWAAWAPckavdAVyVt9cA30oNPwbmR8SSiW64JKk94xqjj4gVwEXARmBxSmlbPvQSsDhvDwAvNN1say47/r7WRcRQRAyNjo6Ot93HSK6BIEkttR30ETEH+D7wyZTS7uZjqZG040rblNL6lNJgSmmwv79/PDdtatNJ3UySTittBX1ETKUR8t9OKd2di7cfGZLJP3fk8hFgWdPNl+YySVIXtDPrJoDbgadTSjc3HdoArM3ba4F7msqvzbNvVgO7moZ4JEmTrK+NOu8E/hx4PCIezWX/AHwO+G5EXAc8B3w4H7sXuBIYBvYBH53IBkuSxqcy6FNKP4KWf5l0+Rj1E3D9KbZLkjRBav2XsUc450aSWisi6CVJrRn0klQ4g16SCmfQS1LhDHpJKlwRQe9SN5LUWq2DPlzsRpIq1TroJUnVDHpJKpxBL0mFM+glqXC1Dvq9rx8C4I1Dh7vcEknqXbUO+gWzpgJw2PmVktRSrYN+xtQpALzuK3pJaqnWQT+9rxH0Dt1IUmv1Dvqpjea//KvXu9wSSepdtQ76aVMazT/Dv5CVpJZqHfQLZ08DYMsre7vcEknqXbUO+vl51s3BNx2jl6RWah30s6Y1vtv82R2/6nJLJKl31TroARbNnsbTL+7udjMkqWfVPujPmTeDzS87Ri9JrdQ+6N+xchEAv9i+p8stkaTeVPug/+DblwDw9f/d3OWWSFJvqn3QX7R8AQDf27T16CJnkqRfq33QA/zjmrcB8MEv/6jLLZGk3lNE0F976QrOP3sOm0f3ctkXHmLPgYPdbpIk9Ywigh7g3o//IcsXzuL5V/fxe5+9jy/e94yLnUkSEKkH1nIfHBxMQ0NDE3JfN9//C2594Nmj++/53bN5/9vO4f0XnsPcGVMn5ByS1AsiYlNKabCyXmlBD7DvjUP8y4PD3P3wVrbv/vXKlufMncFFy+dz4cA8Llo2n7cNzGPO9D6mnOGiaJLq57QO+mbbdx9gw6Mv8pMtr/LY1p3HBP8R5/XPZmD+TM6aM53zz57DnOl9rDxrNnNnTmXZgpksmjO9I22TpFNh0Ldw8M3DPDGyi0ee38nW1/azY88Bntq2m9cPHmZk5/4T3nbx3OmcM3cG0Fg5c/nCWUePLVs46+hqmkDjiePMY58gBubPPPqtWJJ0qtoN+r4OnfwDwJeAKcA3Ukqf68R5TsbUKWdw0fIFR+ffN9v/xpvsP/gmW17Zy659B9m26wDbdx8AGn95u//gmwBseXkvz726j0de2ElKsGt/+7N8zpz+m7/y/jOnc+78mWPWnzltCr+1eM4J73PZgmOfZE7kwoF59I1jqGp63xTmzfKzDanOJjzoI2IK8BXgvcBW4KcRsSGl9NREn2uizZw2hZnTprQdmkfs2neQnfvfOLo/snM/o3uOHSJ6/pV9vLbvN58Qhkd/xd7XDx19Emm2bed+dux5nYd+vqPluQ8d7vw7sgWzpk7IO5HxPCFVmTV9CqvOPnNC7msiNJ6sZ0zqOd/aP4czZ3TktZpOwYy+KZzRY5/7deJRcgkwnFLaDBAR/w6sAXo+6E/WvFlTj3nV+5ZFsyft3Dt2H2DHnva+SvGpF3dz8HD7U04PvZl4YmQXE/EFXk+M7GbX/oPjevfTyos797PHv4JWj4qA8/tP/C682ccvX8UH335uB1vUmaAfAF5o2t8KvOP4ShGxDlgHsHz58g404/Rw9twZnD23vVeSFw7M63BrJs+BMd4Bdcure9/g+Vf3Teo5n3tlLzvHeIeo7npx535Gx/kd1vNmdn5otGvv+1JK64H10PgwtlvtUD310ofa586f2fIzlk5Zfd6iST2f6q0Tfxk7Aixr2l+ayyRJXdCJoP8psCoiVkbENOAaYEMHziNJasOED92klA5FxMeAH9CYXvnNlNKTE30eSVJ7OjJGn1K6F7i3E/ctSRqfYlavlCSNzaCXpMIZ9JJUOINekgrXE6tXRsQo8NxJ3vws4OUJbE4d2OfTg30+PZxKn9+SUuqvqtQTQX8qImKonWU6S2KfTw/2+fQwGX126EaSCmfQS1LhSgj69d1uQBfY59ODfT49dLzPtR+jlySdWAmv6CVJJ2DQS1Lhah30EfGBiHgmIoYj4sZut2e8ImJLRDweEY9GxFAuWxgR90fEs/nnglweEXFr7utjEXFx0/2szfWfjYi1TeV/kO9/ON920r/IMiK+GRE7IuKJprKO97HVObrY589GxEi+1o9GxJVNx27K7X8mIt7fVD7m4zsvAb4xl9+ZlwMnIqbn/eF8fMUkdZmIWBYRD0XEUxHxZER8IpcXe61P0Ofeu9YppVr+o7EE8i+B84BpwM+AC7rdrnH2YQtw1nFlXwBuzNs3Ap/P21cC/wUEsBrYmMsXApvzzwV5e0E+9pNcN/Jtr+hCHy8DLgaemMw+tjpHF/v8WeDvxqh7QX7sTgdW5sf0lBM9voHvAtfk7a8Cf5m3/wr4at6+BrhzEvu8BLg4b58J/CL3rdhrfYI+99y1ntT/9BP8S74U+EHT/k3ATd1u1zj7sIXfDPpngCVND6Rn8vbXgI8cXw/4CPC1pvKv5bIlwM+byo+pN8n9XMGxodfxPrY6Rxf73Oo//zGPWxrf43Bpq8d3DrmXgb5cfrTekdvm7b5cL7p0ze8B3ns6XOsx+txz17rOQzdjfQn5QJfacrIScF9EbIrGl6UDLE4pbcvbLwGL83ar/p6ofOsY5b1gMvrY6hzd9LE8TPHNpuGF8fZ5EbAzpXTouPJj7isf35XrT6o8jHARsJHT5Fof12fosWtd56AvwbtSShcDVwDXR8RlzQdT4+m66Pmvk9HHHvk93ga8Ffh9YBvwxa62pkMiYg7wfeCTKaXdzcdKvdZj9LnnrnWdg772X0KeUhrJP3cA/wFcAmyPiCUA+eeOXL1Vf09UvnSM8l4wGX1sdY6uSCltTym9mVI6DHydxrWG8ff5FWB+RPQdV37MfeXj83L9SRERU2kE3rdTSnfn4qKv9Vh97sVrXeegr/WXkEfE7Ig488g28D7gCRp9ODLTYC2NcT9y+bV5tsJqYFd+u/oD4H0RsSC/RXwfjXG8bcDuiFidZydc23Rf3TYZfWx1jq44EkTZn9K41tBo5zV5FsVKYBWNDx3HfHznV6wPAVfn2x//+zvS56uBB3P9jsu//9uBp1NKNzcdKvZat+pzT17rbnxoMYEfflxJ45PuXwKf7nZ7xtn282h8uv4z4Mkj7acxzvYA8CzwP8DCXB7AV3JfHwcGm+7rL4Dh/O+jTeWD+UH2S+DLdOGDOeA7NN6+HqQxxnjdZPSx1Tm62Od/y316LP8nXdJU/9O5/c/QNDOq1eM7P3Z+kn8X3wOm5/IZeX84Hz9vEvv8LhpDJo8Bj+Z/V5Z8rU/Q55671i6BIEmFq/PQjSSpDQa9JBXOoJekwhn0klQ4g16SCmfQS1LhDHpJKtz/A1/NmoIeUlAfAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "user_click_item_count = sorted(user_click_merge.groupby('user_id')['click_article_id'].count(), reverse=True)\n",
+ "plt.plot(user_click_item_count)"
]
- },
- "execution_count": 37,
- "metadata": {},
- "output_type": "execute_result"
},
{
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以根据用户的点击文章次数看出用户的活跃度"
]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 38,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 38,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count[:100])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以看出点击次数最多的前100篇新闻,点击次数大于1000次"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 39,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYMAAAD4CAYAAAAO9oqkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAApy0lEQVR4nO3deXxU5dn/8c812YBAWJKwB8Mui7INm4q7bFpR64K1D6go9Vd3q60+9lV92tqntrW2tmqLdcEWAbW1UkERcaEuLAEBWSXsYQmRsAfIdv/+mMPTMSYkZDszk+/79ZpXzlznPnOuczLJNefc95ljzjlERKRhC/idgIiI+E/FQEREVAxERETFQEREUDEQEREg3u8EqistLc1lZmb6nYaISFRZunTpV8659LLxqC0GmZmZZGVl+Z2GiEhUMbOt5cV1mkhERFQMRERExUBERKhCMTCzF8xsj5mtKhO/08zWmdlqM/tVWPwhM8s2s/VmNiosPtqLZZvZg2Hxzma2yIvPNLPE2to4ERGpmqocGbwEjA4PmNkFwDign3OuD/AbL94bGA/08ZZ5xszizCwOeBoYA/QGrvfaAjwOPOmc6wbsAybVdKNEROTUVFoMnHMLgPwy4f8H/NI5d9xrs8eLjwNmOOeOO+c2A9nAEO+R7Zzb5JwrBGYA48zMgAuB173lpwJX1GyTRETkVFW3z6AHMMI7vfORmQ324h2A7WHtcrxYRfFUYL9zrrhMXERE6lF1i0E80AoYBjwAvOp9yq9TZjbZzLLMLCsvL69arzH10y28s2oXJaX66m4RkROqe9FZDvAPF7oZwmIzKwXSgB1ARli7jl6MCuJ7gRZmFu8dHYS3/wbn3BRgCkAwGDzl/+alpY7pi7exbvchMlo15qazOnPt4AyaJkXttXciIrWiukcG/wQuADCzHkAi8BUwCxhvZklm1hnoDiwGlgDdvZFDiYQ6mWd5xeQD4GrvdScCb1Yzp0oFAsbsu0bwp+8OpE2zRvz0rTUM/8V8fjFnLTv2H62r1YqIRDyr7E5nZjYdOJ/QJ/9c4BHgr8ALQH+gELjfOfe+1/5h4GagGLjHOfe2Fx8L/A6IA15wzj3mxbsQ6lBuBXwOfPdEx/TJBINBV9Ovo1i+fT/Pf7yZOV/sAmBM37bcMqIL/TNa1Oh1RUQilZktdc4FvxGP1tte1kYxOGHH/qNM/XQL0xdt49DxYgad1pJbzunMyD5tiQvUeVeIiEi9UTGogsPHi3ktazsvfLKZ7flHyWjVmBvP6sy1wY40a5RQq+sSEfGDisEpKCl1zFuTy/Mfb2LJln00S4pn/JAMJp6VSceWTepknSIi9UHFoJrK9iuM7tuWW9WvICJRSsWghnZ6/QqvLN7GoWPFDO+Syv87vysjuqdRD5dYiIjUChWDWnL4eDEzFm/juX9vIvfgcfq0T+G287oypm9b4uP0JbAiEtlUDGrZ8eIS3vx8J39asJFNeUfo1KoJk8/twtWDOtIoIc63vERETkbFoI6UljreXZPLsx9tZMX2/aQ1TeKmszP57rDTaN5YI5BEJLKoGNQx5xwLN+Xz7EcbWfBlHk2T4rlhWCcmnd2Z1imN/E5PRARQMahXq3Yc4M8LNjF75U7iAwG+PagDk8/tSue0ZL9TE5EGTsXAB1v3HmHKgk28tjSHopJSxvRty23ndeXMji38Tk1EGigVAx/lHTrOi59s5q8Lt3LoWDHB01oybkAHLj2jHa2SdZdPEak/KgYR4NCxIqYv3sZrWTls2HOY+IAxonsa4/p34JLebUjWV2mLSB1TMYggzjnW7T7Em8t38q8VO9mx/yiNEgJc0rst4/q159we6STG65oFEal9KgYRqrTUsXTbPt5cvoPZK3exr6CI5o0TGHtGO8b1b8+QzFYE9M2pIlJLVAyiQFFJKR9v+Io3l+/g3TW5FBSW0K55I77Vrz3j+rend7sUffWFiNSIikGUKSgs5r21e5i1fAcfrs+juNTRrXVTLu/XnvFDMmjdTNcuiMipUzGIYvuOFPL2qt28uXwHizbn07NNM+bcPUI33hGRU1ZRMVAvZRRomZzId4Z2Yub3hvPU9QNYn3uIf63Y6XdaIhJDKi0GZvaCme0xs1VhsUfNbIeZLfceY8PmPWRm2Wa23sxGhcVHe7FsM3swLN7ZzBZ58ZlmpoH3J3HZGe3o1S6FJ9/7kqKSUr/TEZEYUZUjg5eA0eXEn3TO9fcecwDMrDcwHujjLfOMmcWZWRzwNDAG6A1c77UFeNx7rW7APmBSTTYo1gUCxgOjerB1bwGvZeX4nY6IxIhKi4FzbgGQX8XXGwfMcM4dd85tBrKBId4j2zm3yTlXCMwAxlloaMyFwOve8lOBK05tExqeC3q2ZmCnFjw1fwPHikr8TkdEYkBN+gzuMLOV3mmkll6sA7A9rE2OF6songrsd84Vl4mXy8wmm1mWmWXl5eXVIPXoZmY8MOp0dh88xt8WbvU7HRGJAdUtBs8CXYH+wC7gidpK6GScc1Occ0HnXDA9Pb0+VhmxhndN5ZxuaTzz4UYOHy+ufAERkZOoVjFwzuU650qcc6XAc4ROAwHsADLCmnb0YhXF9wItzCy+TFyq4P5RPck/UsiLH2/2OxURiXLVKgZm1i7s6ZXAiZFGs4DxZpZkZp2B7sBiYAnQ3Rs5lEiok3mWC13k8AFwtbf8RODN6uTUEPXPaMElvdswZcEm9hcU+p2OiESxqgwtnQ58BvQ0sxwzmwT8ysy+MLOVwAXAvQDOudXAq8Aa4B3gdu8Iohi4A5gLrAVe9doC/Ai4z8yyCfUhPF+rWxjjfjCyB4cLi/nzgk1+pyIiUUxXIMeAu2d8ztzVu1nwwwv0NRUiclK6AjmG3XtxD4pKHM98sNHvVEQkSqkYxIDMtGSuDXZk2qKt5Owr8DsdEYlCKgYx4s4Lu2NmPDV/g9+piEgUUjGIEe1bNOa7Q0/j9aU5bMw77Hc6IhJlVAxiyPcv6EqjhDienPel36mISJRRMYghaU2TuPnszry1cherdx7wOx0RiSIqBjHm1nO7kNIonife1dGBiFSdikGMad44ge+d15X31+1h6daqftmsiDR0KgYx6KazM0lrmsiv564nWi8qFJH6pWIQg5okxnP7Bd1YuCmfT7L3+p2OiEQBFYMY9Z2hnWjfvBG/nrtORwciUikVgxiVFB/H3Rd3Z0XOAeatyfU7HRGJcCoGMezbAzvSOS2ZJ979kpJSHR2ISMVUDGJYfFyAey/pwfrcQ7y1cqff6YhIBFMxiHGXndGO09s247fzvqSopNTvdEQkQqkYxLhAwLh/ZE+27i3g9aU5fqcjIhFKxaABuKhXawZ0asFT8zdwrKjE73REJAJV5baXL5jZHjNbVc68H5iZM7M077mZ2VNmlm1mK81sYFjbiWa2wXtMDIsP8m6hme0ta7W1cRJiZjwwqie7Dhxj2qJtfqcjIhGoKkcGLwGjywbNLAMYCYT/dxkDdPcek4FnvbatgEeAocAQ4BEza+kt8yxwa9hy31iX1NxZXdM4u1sqz3yQzeHjxX6nIyIRJr6yBs65BWaWWc6sJ4EfAm+GxcYBL7vQVU4LzayFmbUDzgfmOefyAcxsHjDazD4EUpxzC734y8AVwNvV3SCp2P0je3LlM59y5yvL6N6mGfEBCz3iAsTHedOBE9NhsbgACQEjLmAkxAXo26E56c2S/N4cEalFlRaD8pjZOGCHc25FmbM6HYDtYc9zvNjJ4jnlxCta72RCRxx06tSpOqk3aAM6teTGszL5+9IcPtu0l+ISR3E1rj9Ia5rErDvOpn2LxnWQpYj44ZSLgZk1Af6b0CmieuWcmwJMAQgGg7qKqhoevbwPj17e5/+eO+coKQ0VheJSR3FJKUUloVhRSak37z+xvMPHueuVz5k0NYvXbxtOclK1Pk+ISISpzl9yV6AzcOKooCOwzMyGADuAjLC2Hb3YDkKnisLjH3rxjuW0l3piZqHTQXFVX+YP3xnAzS8t4Z6Zy/nzdwcRCKjPXyTanfLQUufcF8651s65TOdcJqFTOwOdc7uBWcAEb1TRMOCAc24XMBcYaWYtvY7jkcBcb95BMxvmjSKawNf7ICQCnd+zNT+5rDfz1uTy+Nx1fqcjIrWg0iMDM5tO6FN9mpnlAI84556voPkcYCyQDRQANwE45/LN7GfAEq/dT090JgPfJzRiqTGhjmN1HkeBiWdlkp13mD9/tIlu6U25JphR+UIiErEsWr/eOBgMuqysLL/TaNCKSkq56cUlLNq8l79NGsrQLql+pyQilTCzpc65YNm4rkCWakuIC/D0DQPJaNWE2/62lK17j/idkohUk4qB1Ejzxgm8MHEwDpg0NYsDR4v8TklEqkHFQGosMy2ZZ28YxJavjnDHK8so1rejikQdFQOpFcO7pvLYlX3594av+Olba/xOR0ROka4Yklpz3eBOZO85zHP/3ky31k2ZMDzT75REpIpUDKRWPTimF5vyjvA//1pDZmoy5/ZI9zslEakCnSaSWhUXMH5//QC6t27K7dOWkb3nkN8piUgVqBhIrWuaFM9fJgZJSghw80tZ5B8p9DslEamEioHUiY4tmzBlQpDdB49x29+WUlisEUYikUzFQOrMwE4t+fXVZ7J4cz4Pv/EF0Xq1u0hDoA5kqVPj+ndg457DPPV+Nt1aN+V753X1OyURKYeKgdS5ey7uwca8I/zynXV0SW/KJb3b+J2SiJSh00RS5wIB4zfX9OOMDs25e8bnrNpxwO+URKQMFQOpF40T4/jLhCApjRIY9/Qn3DI1i/fW5OqrK0QihE4TSb1pndKIN24/i5c/28prWTm8tzaX1s2SuHpQR64NZpCZlux3iiINlu5nIL4oKinlg3V7mLlkOx+s30Opg+FdUrlucAaj+7alUcIp3IdTRKqsovsZqBiI73YfOMbfl+Uwc8l2tuUXkNIonisHdOC6wZ3o3T7F7/REYoqKgUS80lLHwk17mbFkO++s3k1hcSlndGjOdYMzuLx/e1IaJfidokjUq/adzszsBTPbY2arwmI/M7OVZrbczN41s/Ze3MzsKTPL9uYPDFtmoplt8B4Tw+KDzOwLb5mnzMxqvrkSjQIB46xuaTx1/QAW//dFPPqt3hSVlPLjf65iyGPv8YNXV7B4c74uXhOpA5UeGZjZucBh4GXnXF8vluKcO+hN3wX0ds7dZmZjgTuBscBQ4PfOuaFm1grIAoKAA5YCg5xz+8xsMXAXsAiYAzzlnHu7ssR1ZNAwOOdYmXOAmVnbmbV8J4ePF3PR6a35y8Qg+twgcuqqfWTgnFsA5JeJHQx7mkzoHzzAOEJFwznnFgItzKwdMAqY55zLd87tA+YBo715Kc65hS5UlV4Grjj1zZNYZWb0y2jBL648g8UPX8QdF3Rj/ro9zF292+/URGJKta8zMLPHzGw7cAPwEy/cAdge1izHi50snlNOvKJ1TjazLDPLysvLq27qEqWaJMZzz8Xd6dGmKf/79jqOF5f4nZJIzKh2MXDOPeycywCmAXfUXkonXecU51zQORdMT9dNUxqi+LgAD1/am617C3j5061+pyMSM2rjCuRpwLe96R1ARti8jl7sZPGO5cRFKnRej3TO65HOU+9v0L0SRGpJtYqBmXUPezoOWOdNzwImeKOKhgEHnHO7gLnASDNraWYtgZHAXG/eQTMb5o0imgC8Wd2NkYbjx5f2oqCwhN+996XfqYjEhEq/jsLMpgPnA2lmlgM8Aow1s55AKbAVuM1rPofQSKJsoAC4CcA5l29mPwOWeO1+6pw70Sn9feAloDHwtvcQOanubZpx/ZAMpi3axoThp9GtdTO/UxKJarroTKLW3sPHOf/XHxLMbMmLNw3xOx2RqFDtoaUikSq1aRJ3XNiND9bnseBLjS4TqQkVA4lqN56dSUarxjw2ey0lpdF5lCsSCVQMJKolxcfx0JherM89xMwl2ytfQETKpWIgUW9M37YMzmzJb+et59CxIr/TEYlKKgYS9cyMH1/am68OF/LMhxv9TkckKqkYSEzol9GCqwZ04PmPN7M9v8DvdESijoqBxIwHRvckYPD4O+sqbywiX6NiIDGjXfPGTD63K2+t3MXSrfv8TkckqqgYSEz53rldaN0siZ+9tYZSDTUVqTIVA4kpyUnxPDCqJ8u37+dfK3f6nY5I1FAxkJjz7YEd6dM+hcffXsexIt3zQKQqVAwk5gQCoaGmOw8c4/mPN/udjkhUUDGQmDS8ayoje7fhmQ+y2XPomN/piEQ8FQOJWQ+N7UVhSSm/fVf3PBCpjIqBxKzOaclMGJ7JzKztrNl50O90RCKaioHEtLsu7E7zxgn8fPYaovXeHSL1QcVAYlrzJgncc1F3Pt24l/lr9/idjkjEqrQYmNkLZrbHzFaFxX5tZuvMbKWZvWFmLcLmPWRm2Wa23sxGhcVHe7FsM3swLN7ZzBZ58ZlmlliL2yfCDcNOo0t6Mr+Ys5aiklK/0xGJSFU5MngJGF0mNg/o65w7E/gSeAjAzHoD44E+3jLPmFmcmcUBTwNjgN7A9V5bgMeBJ51z3YB9wKQabZFIGQlxAR4e24tNXx3hbwu3+p2OSESqtBg45xYA+WVi7zrnir2nC4GO3vQ4YIZz7rhzbjOQDQzxHtnOuU3OuUJgBjDOzAy4EHjdW34qcEXNNknkmy48vTXndEvjd+9tYH9Bod/piESc2ugzuBl425vuAITfbirHi1UUTwX2hxWWE/FymdlkM8sys6y8PN3zVqrOzHj40l4cOlbE4++sV2eySBk1KgZm9jBQDEyrnXROzjk3xTkXdM4F09PT62OVEkN6tUvhprM7M33xNm59OUtHCCJhql0MzOxG4DLgBvefj1k7gIywZh29WEXxvUALM4svExepEz++tBePfqs3H32Zx6VPfczn2/RV1yJQzWJgZqOBHwKXO+fCbys1CxhvZklm1hnoDiwGlgDdvZFDiYQ6mWd5ReQD4Gpv+YnAm9XbFJHKmRk3nt2Z1287CzO49s+f8fzHm3XaSBq8qgwtnQ58BvQ0sxwzmwT8EWgGzDOz5Wb2JwDn3GrgVWAN8A5wu3OuxOsTuAOYC6wFXvXaAvwIuM/Msgn1ITxfq1soUo5+GS2YfecIzu/Zmp+9tYbb/raUA0eL/E5LxDcWrZ+IgsGgy8rK8jsNiXLOOZ7/eDO/fHsd7Vo04pnvDOKMjs39TkukzpjZUudcsGxcVyBLg2Zm3DKiCzO/N5ySEse3n/2Ulz/botNG0uCoGIgAg05ryey7RnB2t1R+8uZq7njlcw4d02kjaThUDEQ8LZMTeX7iYB4cczrvrN7Nt/7wMat3HvA7LZF6oWIgEiYQMG47ryszJg/jaFEJVz7zKdMWbdVpI4l5KgYi5Ric2Yo5d41gaOdWPPzGKu6ZuZwjx4srX1AkSqkYiFQgtWkSU28awv0je/CvFTv51h8/Zt1u3SRHYpOKgchJBALGHRd2Z9otwzh0rJgrnv6Ef36ui+Ql9qgYiFTB8K6pzLlrBGd2bMH9r61gZc5+v1MSqVUqBiJVlN4sief+K0h6syTumbmco4UlfqckUmtUDEROQfMmCfzmmn5syjvCL+as9TsdkVqjYiByis7ulsYt53Tmrwu38sE63VdZYoOKgUg13D+qJ6e3bcYDr69k7+HjfqcjUmMqBiLV0Cghjiev68/Bo0U89I8vdFGaRD0VA5Fq6tUuhQdG9eTdNbm8mrW98gVEIpiKgUgNTDqnM8O7pPI//1rD1r1H/E5HpNpUDERqIBAwnri2H/EB456ZyykuKfU7JZFqUTEQqaH2LRrz8yvP4PNt+3nmw41+pyNSLSoGIrXg8n7tGde/Pb+fv4Hl2/f7nY7IKavKPZBfMLM9ZrYqLHaNma02s1IzC5Zp/5CZZZvZejMbFRYf7cWyzezBsHhnM1vkxWeaWWJtbZxIffrpuL60aZbEvTOXU1CobziV6FKVI4OXgNFlYquAq4AF4UEz6w2MB/p4yzxjZnFmFgc8DYwBegPXe20BHgeedM51A/YBk6q3KSL+at44gSeu7c+WvUf4+WxdnSzRpdJi4JxbAOSXia11zq0vp/k4YIZz7rhzbjOQDQzxHtnOuU3OuUJgBjDOzAy4EHjdW34qcEV1N0bEb8O7pjJ5RBdeWbSN+Wtz/U5HpMpqu8+gAxA+4DrHi1UUTwX2O+eKy8TLZWaTzSzLzLLy8vJqNXGR2nLfyB70apfCj/6+kq90dbJEiajqQHbOTXHOBZ1zwfT0dL/TESlXUnwcv7uuPwePFfPg31fq6mSJCrVdDHYAGWHPO3qxiuJ7gRZmFl8mLhLVerZtxo9Gn857a/cwfbGuTpbIV9vFYBYw3sySzKwz0B1YDCwBunsjhxIJdTLPcqGPTB8AV3vLTwTerOWcRHxx01mZnNMtjZ+9tYbNX+nqZIlsVRlaOh34DOhpZjlmNsnMrjSzHGA4MNvM5gI451YDrwJrgHeA251zJV6fwB3AXGAt8KrXFuBHwH1mlk2oD+H52t1EEX8EAsZvrulHYnyAe2Yup0hXJ0sEs2g9nxkMBl1WVpbfaYhUavbKXdz+yjLuvqg7917Sw+90pIEzs6XOuWDZeFR1IItEo0vPbMdVAzrwxw+yWbZtn9/piJRLxUCkHjw6rg9tUxpx78zlHDmuq5Ml8qgYiNSDlEYJPHldf7blF/DorNUabioRR8VApJ4M6dyK28/vxmtLc3j+481+pyPyNfGVNxGR2nLfJT3YmHeYx+asJaNVE0b1aet3SiKAjgxE6lUgYPz22v6c2bEF98xYzsqc/X6nJAKoGIjUu8aJcfxlQpBWyYlMmprFjv1H/U5JRMVAxA/pzZJ46abBHCsq4eYXl3DoWJHfKUkDp2Ig4pPubZrx7A2D2Jh3mO9PW6YrlMVXKgYiPjqnexqPXdmXf2/4ikc05FR8pNFEIj67bnAntuwt4NkPN5KZ2oTJ53b1OyVpgFQMRCLAAyN7sm1vAb+Ys46Mlk0Yc0Y7v1OSBkaniUQiQCBgPHFtPwZ0asE9M5fzub7DSOqZioFIhGiUEMdzE4K0Tkni1pez2J5f4HdK0oCoGIhEkLSmSbx442AKi0u5+aUlHDiqIadSP1QMRCJMt9bN+NN/DWLzV0f4/rSlGnIq9ULFQCQCndU1jf+96gw+yd7Lj99YpSGnUueqctvLF8xsj5mtCou1MrN5ZrbB+9nSi5uZPWVm2Wa20swGhi0z0Wu/wcwmhsUHmdkX3jJPmZnV9kaKRKNrghnceWE3ZmZt59mPNvqdjsS4qhwZvASMLhN7EJjvnOsOzPeeA4wBunuPycCzECoewCPAUGAI8MiJAuK1uTVsubLrEmmw7rukB5f3a8+v3lnPWyt3+p2OxLBKi4FzbgGQXyY8DpjqTU8FrgiLv+xCFgItzKwdMAqY55zLd87tA+YBo715Kc65hS50HPxy2GuJNHhmxq+uPpPgaS2579UVLN2qIadSN6rbZ9DGObfLm94NtPGmOwDbw9rleLGTxXPKiZfLzCabWZaZZeXl5VUzdZHo0ighjikTgrRr3ohbX85i214NOZXaV+MOZO8Tfb30bjnnpjjngs65YHp6en2sUiQitEpO5MUbB1NS6rjq2U/5w/wN7D183O+0JIZUtxjkeqd48H7u8eI7gIywdh292MniHcuJi0gZXdKbMu2WofRq14wn5n3J8F++z/2vrWDVjgN+pyYxoLrFYBZwYkTQRODNsPgEb1TRMOCAdzppLjDSzFp6HccjgbnevINmNswbRTQh7LVEpIy+HZrz10lDmXfvuVwb7Mjslbu47A8fc+2fPmPOF7so1jUJUk1W2fhlM5sOnA+kAbmERgX9E3gV6ARsBa51zuV7/9D/SGhEUAFwk3Muy3udm4H/9l72Mefci148SGjEUmPgbeBOV4VB1cFg0GVlZZ3CporEngMFRbyatZ2pn20hZ99R2jdvxH8Nz2T84AxaJif6nZ5EIDNb6pwLfiMerRezqBiI/EdJqWP+2lxe/GQLn23aS1J8gCsHdODGszM5vW2K3+lJBFExEGkg1u0+yNRPt/CPZTs4XlzK8C6p3Hh2Jhf3akNcQNd0NnQqBiINzL4jhcxYsp2/fraFnQeO0bFlYyYOz+Syfu1o3jiBxglx6IL/hkfFQKSBKi4pZd6a0CmkxVv+c/2oGTRJiKNJUjzJiXE0SYynSeLXnycnhcUT40hOiic5KZ6URvGkNE4gpVECKY3jad44gaT4OB+3UqqqomKgO52JxLj4uABjzmjHmDPasXrnAZZu3UdBYQkFx4s5UlhCQWExBYUlHDkemj5wtIhd+4+GYoXFFBwvobAKo5SS4gNegQgvFGWfx5OanMhpqcmcltqEJon6FxQp9JsQaUD6tG9On/bNT3m5opLSUAEpLObwsWIOHivi4NETP4s4eKzY+/mf+P6CQrblF/xfvKjkm2chWjdLItMrDJlp3s/UZDqlNiGlUUJtbLJUkYqBiFQqIS5A88YBmjdOgFOvJTjnOFZUysFjRew5eJyt+UfYureArXuPsGVvAQs25PHa0pyvLRM6gmjiFYtQoTgttQldWzdVoagDKgYiUufMjMaJcTROjKNNSiPO6PjNilJQWMy2/AK2fPWfIrF17xEWbc7njeU7ONG9GRcwgqe15JLebbi4Vxsy05LreWtikzqQRSTiHSsqIWdfqFB8vn0f89fuYd3uQwB0a92Ui3q15pJebRjQqaWGz1ZCo4lEJKZszy/gvbW5zF+7h4Wb9lJc6miVnMiFp7fm4l5tGNE9jeQknfwoS8VARGLWwWNFfLQ+j/fW5vLBuj0cPFZMYnyAs7qmcnGvNlzUqzXtmjf2O82IoGIgIg1CUUkpWVv28d7aXN5bm8tW7/4PfTukcHGvUD9Dn/YpDfaCOxUDEWlwnHNszDvMvDV7eG9tLsu27cM56NCiMZf0bsMlvdswpHMrEuJqfGuXqKFiICIN3leHj/P+2j28uyaXf2/I43hxKSmN4rnw9NZc0rst5/VMp2mM9zOoGIiIhCkoLObfG75i3ppc5q/NZV9BEYlxAc7qlsrI3m25uFdrWqc08jvNWqdiICJSgeKSUpZu3ce7a3KZtyaXbfmhfoYBnVpwSe82jOzdlm6tm/qcZe1QMRARqQLnHOtzDzFvdS7vrsnlC++2ol3SkrmkdxtG921L/4wWUdsBrWIgIlINO/cf5b21oSOGzzaGrmfo0aYp4wd34qqBHWjRJLruKKdiICJSQweOFvH2F7uYvngbK3IOkBgfYGzftowf0omhnVtFxdFCnRQDM7sbuBUw4Dnn3O/MrBUwE8gEthC6P/I+7/7IvwfGEro/8o3OuWXe60wEfuy97M+dc1MrW7eKgYj4afXOA8xYvJ1/fr6DQ8eL6ZKezPjBGXx7YEdSmyb5nV6Far0YmFlfYAYwBCgE3gFuAyYD+c65X5rZg0BL59yPzGwscCehYjAU+L1zbqhXPLKAIOCApcAg59y+k61fxUBEIkFBYTGzV+5ixpLtLN26j4Q4Y2SftnxnSCeGd0klEGHflVQXN7fpBSxyzhV4K/gIuAoYB5zvtZkKfAj8yIu/7ELVZ6GZtTCzdl7bec65fO915gGjgek1yE1EpF40SYznmmAG1wQz+DL3ENMXb+Mfy3Ywe+UuOrVqwvghGVw9qCOtm0X2MNWaXHa3ChhhZqlm1oTQJ/4MoI1zbpfXZjfQxpvuAGwPWz7Hi1UU/wYzm2xmWWaWlZeXV4PURURqX482zXjkW31Y9N8X8fvx/WnXvBG/emc9Z/3v+3zvr1l8uH4PJaWR2U9b7SMD59xaM3sceBc4AiwHSsq0cWZWa1vunJsCTIHQaaLael0RkdrUKCGOcf07MK5/BzblHWbmku28tjSHuatz6ZyWzO+u60+/jBZ+p/k1NfpCDufc8865Qc65c4F9wJdArnf6B+/nHq/5DkJHDid09GIVxUVEol6X9KY8NLYXCx+6iD9+ZwCFxaVc/adPeemTzUTSaM4aFQMza+397ESov+AVYBYw0WsyEXjTm54FTLCQYcAB73TSXGCkmbU0s5bASC8mIhIzEuMDXHZme2bfdQ7ndk/n0X+t4fZXlnHwWJHfqQE1v+3l380sFSgCbnfO7TezXwKvmtkkYCtwrdd2DqF+hWxCQ0tvAnDO5ZvZz4AlXrufnuhMFhGJNS2aJPLchCDP/XsTv5q7ntU7P+bp7wykb4dq3Fy6FumiMxERn2RtyeeOVz4nv6CQn1zWmxuGdqrzC9cqGlracL7EW0QkwgQzWzH7rnMY3iWVH/9zFXfNWM7h48W+5KJiICLio9SmSbx442AeGNWT2St3cvkfPmbtroP1noeKgYiIzwIB4/YLuvHKrcM4fLyYK57+hJlLttXraCMVAxGRCDGsSyqz7xrB4MxW/OjvX/CDV1dQUFg/p41UDEREIkh6sySm3jyEey/uwRvLd3D5Hz/hy9xDdb5eFQMRkQgTFzDuvrg7f5s0lP0FhYz74ye8vjSnTtepYiAiEqHO7pbGnLtG0C+jOfe/toIfvr6Co4UllS9YDSoGIiIRrHVKI6bdMoy7LuzGa0tzuOLpT8g9eKzW11PTK5BFRKSOxQWM+0b2JJjZimmLttIqufZvtaliICISJc7tkc65PdLr5LV1mkhERFQMRERExUBERFAxEBERVAxERAQVAxERQcVARERQMRAREaL4tpdmlkfoHsvVkQZ8VYvp1DblVzPKr2aUX81Een6nOee+ceVa1BaDmjCzrPLuARoplF/NKL+aUX41E+n5VUSniURERMVAREQabjGY4ncClVB+NaP8akb51Uyk51euBtlnICIiX9dQjwxERCSMioGIiMR2MTCz0Wa23syyzezBcuYnmdlMb/4iM8usx9wyzOwDM1tjZqvN7O5y2pxvZgfMbLn3+El95eetf4uZfeGtO6uc+WZmT3n7b6WZDazH3HqG7ZflZnbQzO4p06Ze95+ZvWBme8xsVVislZnNM7MN3s+WFSw70Wuzwcwm1mN+vzazdd7v7w0za1HBsid9L9Rhfo+a2Y6w3+HYCpY96d96HeY3Myy3LWa2vIJl63z/1ZhzLiYfQBywEegCJAIrgN5l2nwf+JM3PR6YWY/5tQMGetPNgC/Lye984C0f9+EWIO0k88cCbwMGDAMW+fi73k3oYhrf9h9wLjAQWBUW+xXwoDf9IPB4Ocu1AjZ5P1t60y3rKb+RQLw3/Xh5+VXlvVCH+T0K3F+F3/9J/9brKr8y858AfuLX/qvpI5aPDIYA2c65Tc65QmAGMK5Mm3HAVG/6deAiM7P6SM45t8s5t8ybPgSsBTrUx7pr0TjgZReyEGhhZu18yOMiYKNzrrpXpNcK59wCIL9MOPw9NhW4opxFRwHznHP5zrl9wDxgdH3k55x71zlX7D1dCHSs7fVWVQX7ryqq8rdeYyfLz/u/cS0wvbbXW19iuRh0ALaHPc/hm/9s/6+N9wdxAEitl+zCeKenBgCLypk93MxWmNnbZtanfjPDAe+a2VIzm1zO/Krs4/ownor/CP3cfwBtnHO7vOndQJty2kTKfryZ0JFeeSp7L9SlO7zTWC9UcJotEvbfCCDXObehgvl+7r8qieViEBXMrCnwd+Ae59zBMrOXETr10Q/4A/DPek7vHOfcQGAMcLuZnVvP66+UmSUClwOvlTPb7/33NS50viAix3Kb2cNAMTCtgiZ+vReeBboC/YFdhE7FRKLrOflRQcT/LcVyMdgBZIQ97+jFym1jZvFAc2BvvWQXWmcCoUIwzTn3j7LznXMHnXOHvek5QIKZpdVXfs65Hd7PPcAbhA7Hw1VlH9e1McAy51xu2Rl+7z9P7olTZ97PPeW08XU/mtmNwGXADV7B+oYqvBfqhHMu1zlX4pwrBZ6rYL1+77944CpgZkVt/Np/pyKWi8ESoLuZdfY+PY4HZpVpMws4MXLjauD9iv4Yapt3jvF5YK1z7rcVtGl7og/DzIYQ+n3VS7Eys2Qza3ZimlBH46oyzWYBE7xRRcOAA2GnROpLhZ/I/Nx/YcLfYxOBN8tpMxcYaWYtvdMgI71YnTOz0cAPgcudcwUVtKnKe6Gu8gvvg7qygvVW5W+9Ll0MrHPO5ZQ308/9d0r87sGuyweh0S5fEhpp8LAX+ymhNz5AI0KnF7KBxUCXesztHEKnDFYCy73HWOA24DavzR3AakKjIxYCZ9Vjfl289a7wcjix/8LzM+Bpb/9+AQTr+febTOife/OwmG/7j1BR2gUUETpvPYlQH9R8YAPwHtDKaxsE/hK27M3e+zAbuKke88smdL79xHvwxOi69sCck70X6im/v3rvrZWE/sG3K5uf9/wbf+v1kZ8Xf+nEey6sbb3vv5o+9HUUIiIS06eJRESkilQMRERExUBERFQMREQEFQMREUHFQEREUDEQERHg/wMY38td1sX9QQAAAABJRU5ErkJggg==\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count[:20])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "点击次数最多的前20篇新闻,点击次数大于2500。思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 40,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(item_click_count[3500:])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以发现很多新闻只被点击过一两次。思路:可以定义这些新闻是冷门新闻"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻共现频次:两篇新闻连续出现的次数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 433597.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 3.184139 \n",
- " \n",
- " \n",
- " std \n",
- " 18.851753 \n",
- " \n",
- " \n",
- " min \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 2.000000 \n",
- " \n",
- " \n",
- " max \n",
- " 2202.000000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 34,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " count\n",
- "count 433597.000000\n",
- "mean 3.184139\n",
- "std 18.851753\n",
- "min 1.000000\n",
- "25% 1.000000\n",
- "50% 1.000000\n",
- "75% 2.000000\n",
- "max 2202.000000"
- ]
- },
- "execution_count": 41,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "tmp = user_click_merge.sort_values('click_timestamp')\n",
- "tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].transform(lambda x:x.shift(-1))\n",
- "union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].agg({'count'}).reset_index().sort_values('count', ascending=False)\n",
- "union_item[['count']].describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "由统计数据可以看出,平均共现次数3.18,最高为2202。\n",
- "\n",
- "说明用户看的新闻,相关性是比较强的。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- ""
- ]
- },
- "execution_count": 42,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#画个图直观地看一看\n",
- "x = union_item['click_article_id']\n",
- "y = union_item['count']\n",
- "plt.scatter(x, y)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 43,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 43,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAD4CAYAAADvsV2wAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAATdElEQVR4nO3df6xkZX3H8fe37Aq2EPmxN7pd9nKhmhgxuOB1hUANISHlV+CPYrqkRUTNNoopVlsrmiCamIhNlSpG3ApF1Cr4syuFWFqwahuW7OKy/BK9KgYQ3AVkkarU1W//mLMwd5hzZ+7MmTt3znm/ksmeOeeZOd89s/dzn32ec85EZiJJqr/fG3cBkqSlYeBLUkMY+JLUEAa+JDWEgS9JDbFiXDtetWpVzszMjGv3kjSRtm3b9mhmTg3y2rEF/szMDFu3bh3X7iVpIkXETwZ9rUM6ktQQBr4kNYSBL0kNYeBLUkMY+JLUEH0HfkTsExHfjYjru2zbNyKujYi5iNgSETOVVilJGtpievgXAveWbHsj8PPMfDHwEeDSYQuTJFWrr/PwI+JQ4HTgA8DbuzQ5C7ikWP4ScHlERI7g3sv3PfIL/m3HT0u3n/CSKdYffnDVu5WkidfvhVeXAe8EDijZvgZ4ACAz90TEbuAQ4NH2RhGxEdgIMD09PUC5MLfzKT52y1zXbZlw648f57q/PG6g95akOusZ+BFxBrAzM7dFxInD7CwzNwGbAGZnZwfq/Z9+1GpOP+r0rtv+/FO38vRvfjd4gZJUY/2M4R8PnBkR9wNfAE6KiM92tHkIWAsQESuAFwCPVVhn3/z+LknqrmfgZ+ZFmXloZs4AG4CbM/MvOpptBs4rls8u2pi9krSMDHzztIh4P7A1MzcDVwKfiYg54HFavxiWXBD4e0aSultU4GfmN4FvFssXt63/NfDaKguTJFWrVlfaRoy7AklavmoV+OCkrSSVqV3gS5K6q13gO2crSd3VLvAlSd3VKvAjwjF8SSpRq8CXJJWrVeB7VqYklatV4APO2kpSiVoFvhdeSVK5WgU+eOGVJJWpXeBLkrqrVeAHDuFLUplaBb4kqVytAj+ctZWkUrUKfIB02laSuqpV4Nu/l6RytQp8cNJWksrULvAlSd3VKvAj7OFLUplaBb4kqVzNAt9pW0kqU7PA9146klSmVoHvdVeSVK5n4EfEfhFxW0TcERF3R8T7urR5fUTsiojtxeNNoym3t3TWVpK6WtFHm6eBkzLzqYhYCXwnIm7MzFs72l2bmW+tvkRJUhV6Bn62usxPFU9XFo9l2Y12REeSyvU1hh8R+0TEdmAncFNmbunS7E8jYkdEfCki1pa8z8aI2BoRW3ft2jV41ZKkResr8DPzt5m5DjgUWB8RL+9o8nVgJjOPAm4CPl3yPpsyczYzZ6empoYouzsnbSWp3KLO0snMJ4BbgFM61j+WmU8XTz8FvLKS6gbgnK0kddfPWTpTEXFgsfx84GTgex1tVrc9PRO4t8Ia+xaO4ktSqX7O0lkNfDoi9qH1C+K6zLw+It4PbM3MzcBfRcSZwB7gceD1oyq4F++HL0nd9XOWzg7g6C7rL25bvgi4qNrSJElVqt2Vto7hS1J3tQp8SVK5WgW+p2VKUrlaBT4s00uAJWkZqFXge1qmJJWrVeCDd8uUpDK1C3xJUnf1CvxwDF+SytQr8CVJpWoV+E7ZSlK5WgU+4JiOJJWoVeCHV15JUqlaBT7YwZekMrULfElSd7UK/MALrySpTK0CX5JUrlaB75ytJJWrVeCDk7aSVKZWgW8HX5LK1Srwwa84lKQytQt8SVJ3tQr8iCAdxZekrmoV+JKkcrUKfCdtJalcz8CPiP0i4raIuCMi7o6I93Vps29EXBsRcxGxJSJmRlJtH5y0laTu+unhPw2clJmvANYBp0TEsR1t3gj8PDNfDHwEuLTSKvtlF1+SSq3o1SBbN6d5qni6snh09qPPAi4plr8EXB4RkWO4sc3j//t//O0X7xj49RvWr+WVhx1cYUWStDz0DHyAiNgH2Aa8GPh4Zm7paLIGeAAgM/dExG7gEODRjvfZCGwEmJ6eHq7yLtbPHMytP3yM/557tHfjLh558tckGPiSaqmvwM/M3wLrIuJA4KsR8fLMvGuxO8vMTcAmgNnZ2cp7/xvWT7Nh/eC/SI7/4M0VViNJy8uiztLJzCeAW4BTOjY9BKwFiIgVwAuAxyqob8k56Suprvo5S2eq6NkTEc8HTga+19FsM3BesXw2cPM4xu8lSeX6GdJZDXy6GMf/PeC6zLw+It4PbM3MzcCVwGciYg54HNgwsopHzCt1JdVVP2fp7ACO7rL+4rblXwOvrbY0SVKVanWl7bD8AhVJdWbgd3JER1JNGfht7OFLqjMDv4MdfEl1ZeBLUkMY+G2CwMsHJNWVgS9JDWHgt3HSVlKdGfgdHNCRVFcGfhs7+JLqzMDv4JytpLoy8CWpIQz8NhHhGL6k2jLwJakhDPw2TtpKqjMDv4NX2kqqKwO/nV18STVm4Hewfy+prgx8SWoIA79NgF18SbVl4EtSQxj4bcLbZUqqMQO/QzqmI6mmDPw29u8l1VnPwI+ItRFxS0TcExF3R8SFXdqcGBG7I2J78bh4NOWOntddSaqrFX202QO8IzNvj4gDgG0RcVNm3tPR7tuZeUb1JUqSqtCzh5+ZD2fm7cXyL4B7gTWjLmwcIuzhS6qvRY3hR8QMcDSwpcvm4yLijoi4MSKOLHn9xojYGhFbd+3atfhqJUkD6zvwI2J/4MvA2zLzyY7NtwOHZeYrgI8BX+v2Hpm5KTNnM3N2ampqwJJHJ5y2lVRjfQV+RKykFfafy8yvdG7PzCcz86li+QZgZUSsqrTSJeJpmZLqqp+zdAK4Erg3Mz9c0uZFRTsiYn3xvo9VWehS8LorSXXWz1k6xwPnAndGxPZi3buBaYDMvAI4G3hzROwBfgVsyAm9sfxkVi1JvfUM/Mz8Dj2uScrMy4HLqypKklQ9r7TtYAdfUl0Z+JLUEAZ+G++WKanODPwOTtpKqisDv439e0l1ZuA/h118SfVk4EtSQxj4bbxbpqQ6M/AlqSEM/DYRjuBLqi8DX5IawsBv4/3wJdWZgd9hQm/yKUk9GfiS1BAGfhsnbSXVmYEvSQ1h4LcJvPBKUn0Z+JLUEAZ+O++HL6nGDPwOjuhIqisDX5IawsBv05q0tY8vqZ4MfElqCAO/jXO2kuqsZ+BHxNqIuCUi7omIuyPiwi5tIiI+GhFzEbEjIo4ZTbmSpEGt6KPNHuAdmXl7RBwAbIuImzLznrY2pwIvKR6vBj5R/DlR7OBLqrOegZ+ZDwMPF8u/iIh7gTVAe+CfBVyTrRnPWyPiwIhYXbx2otz50G7OvXLLuMt4jvOPn+Gkl75w3GVImmD99PCfEREzwNFAZyKuAR5oe/5gsW5e4EfERmAjwPT09CJLHb0zjvpDvr7jpzz19J5xlzLP3Q89ydQB+xr4kobSd+BHxP7Al4G3ZeaTg+wsMzcBmwBmZ2eX3fmPbzjhcN5wwuHjLuM5Trj05nGXIKkG+jpLJyJW0gr7z2XmV7o0eQhY2/b80GKdqrLsfj1KmjT9nKUTwJXAvZn54ZJmm4HXFWfrHAvsnsTxe0mqs36GdI4HzgXujIjtxbp3A9MAmXkFcANwGjAH/BI4v/JKG8wvZpFUhX7O0vkOPc5YLM7OuaCqoiRJ1fNK2wkQXiEgqQIG/oTwpm6ShmXgTwDv8SOpCgb+hLB/L2lYBr4kNYSBPwFaX8wy7iokTToDX5IawsCfABHhGL6koRn4ktQQBv4E8KxMSVUw8CeEF15JGpaBL0kNYeBPAu+WKakCBr4kNYSBPwEC7OJLGpqBL0kNYeBPgPB2mZIqYOBPiHRMR9KQDHxJaggDfwJ4t0xJVTDwJakhDPwJEGEPX9LwDHxJaggDfwKE98uUVIGegR8RV0XEzoi4q2T7iRGxOyK2F4+Lqy9TnpYpaVgr+mhzNXA5cM0Cbb6dmWdUUpEkaSR69vAz81vA40tQi0o4aSupClWN4R8XEXdExI0RcWRZo4jYGBFbI2Lrrl27Ktq1JKkfVQT+7cBhmfkK4GPA18oaZuamzJzNzNmpqakKdt0cdvAlDWvowM/MJzPzqWL5BmBlRKwaujJJUqWGDvyIeFEUt3OMiPXFez427PvqWd4tU1IVep6lExGfB04EVkXEg8B7gZUAmXkFcDbw5ojYA/wK2JB+43blPKKShtUz8DPznB7bL6d12qYkaRnzStsJ0BrQsYsvaTgGviQ1hIE/AbzwSlIVDHxJaggDfwJ4VqakKhj4E8IRHUnDMvAngPfDl1QFA39CeC2bpGEZ+JLUEAb+BIhwDF/S8Ax8SWoIA38COGUrqQoG/oRwzlbSsAz8SeCVV5IqYOBPCDv4koZl4EtSQxj4EyDwwitJwzPwJakhDPwJ4JytpCoY+JLUEAb+BLCDL6kKBv6EcM5W0rAMfElqCAN/AkQE6aVXkobUM/Aj4qqI2BkRd5Vsj4j4aETMRcSOiDim+jIlScPqp4d/NXDKAttPBV5SPDYCnxi+LLVz0lZSFVb0apCZ34qImQWanAVck61LQW+NiAMjYnVmPlxVkYLbf/IEJ3/4v8ZdhqQK/Nmr1vKmPz5iyffbM/D7sAZ4oO35g8W65wR+RGyk9b8ApqenK9h1M5x73GF84+5Hxl2GpIqs2n/fsey3isDvW2ZuAjYBzM7OOgvZp7PWreGsdWvGXYakCVfFWToPAWvbnh9arJMkLSNVBP5m4HXF2TrHArsdv5ek5afnkE5EfB44EVgVEQ8C7wVWAmTmFcANwGnAHPBL4PxRFStJGlw/Z+mc02N7AhdUVpEkaSS80laSGsLAl6SGMPAlqSEMfElqiBjXl2NHxC7gJwO+fBXwaIXlVMnaFm+51gXWNojlWhfUo7bDMnNqkB2MLfCHERFbM3N23HV0Y22Lt1zrAmsbxHKtC6zNIR1JaggDX5IaYlIDf9O4C1iAtS3ecq0LrG0Qy7UuaHhtEzmGL0lavEnt4UuSFsnAl6SmyMyJetD6ft37aN2d810j3M/9wJ3AdmBrse5g4CbgB8WfBxXrA/hoUdMO4Ji29zmvaP8D4Ly29a8s3n+ueG0sUMtVwE7grrZ1I6+lbB991HYJre9E2F48TmvbdlGxn/uAP+n1uQKHA1uK9dcCzyvW71s8nyu2z3TUtRa4BbgHuBu4cLkctwVqG+txA/YDbgPuKOp63xDvVUm9fdR2NfDjtmO2bkw/B/sA3wWuXy7HrGuWjCowR/EoDuoPgSOA5xUf/stGtK/7gVUd6z6094AD7wIuLZZPA24s/pEdC2xp+4fyo+LPg4rlvQFzW9E2iteeukAtrwGOYX6ojryWsn30UdslwN90afuy4jPbt/jH+sPiMy39XIHrgA3F8hXAm4vltwBXFMsbgGs79rWa4occOAD4frH/sR+3BWob63Er/h77F8sraYXJsYt9ryrr7aO2q4Gzuxyzpf45eDvwLzwb+GM/Zl2zZBRhOaoHcBzwjbbnFwEXjWhf9/PcwL8PWN32Q3tfsfxJ4JzOdsA5wCfb1n+yWLca+F7b+nntSuqZYX6ojryWsn30UdsldA+ueZ8X8I3iM+36uRY/eI8CKzo//72vLZZXFO0W+l/SvwInL6fj1qW2ZXPcgN8Hbgdevdj3qrLekuPVXtvVdA/8Jfs8aX3L338CJwHXD3L8R33M9j4mbQy/7AvTRyGBf4+IbcWXrwO8MJ/9Nq9HgBf2qGuh9Q92Wb8YS1FL2T768daI2BERV0XEQQPWdgjwRGbu6VLbM68ptu8u2j9HRMwAR9PqFS6r49ZRG4z5uEXEPhGxndYw3U20epeLfa8q620/VvNqy8y9x+wDxTH7SETs/Xbwpfw8LwPeCfyueD7I8R/JMes0aYG/lE7IzGOAU4ELIuI17Ruz9Ws1x1JZh6WoZZH7+ATwR8A64GHgH0ZUVk8RsT/wZeBtmflk+7ZxH7cutY39uGXmbzNzHa1e63rgpUtdQ5nO2iLi5bR6uy8FXkVrmObvRlzDvM8zIs4AdmbmtlHutyqTFvhL9oXpmflQ8edO4Ku0/vH/LCJWAxR/7uxR10LrD+2yfjGWopayfSwoM39W/HD+DvgnWsdukNoeAw6MiBUd6+e9V7H9BUX7Z0TESlqB+rnM/EqPv9OSHrdutS2X41bU8gStieXjBnivKut9jrbaTsnMh7PlaeCfGfyYDfp5Hg+cGRH3A1+gNazzjwv8fcZyzJ7Ra8xnOT1ojXn9iNakxt4JjCNHsJ8/AA5oW/4fWjPlf8/8yZsPFcunM3+C6LZi/cG0ziA4qHj8GDi42NY5QXRaj5pmmD9OPvJayvbRR22r25b/GvhCsXwk8yemfkRrUqr0cwW+yPyJqbcUyxcwf/Lruo6aArgGuKxj/diP2wK1jfW4AVPAgcXy84FvA2cs9r2qrLeP2la3HdPLgA+O8efgRJ6dtB37MetaY9VhOeoHrdn379MaW3zPiPZxRHFg954C9p5i/SG0Jmd+APxH2z+UAD5e1HQnMNv2Xm+gddrUHHB+2/pZ4K7iNZez8ITj52n9F/83tMbq3rgUtZTto4/aPlPsewewmflB9p5iP/fRdmZS2edafBa3FTV/Edi3WL9f8Xyu2H5ER10n0Pqv9w7aTnNcDsdtgdrGetyAo2idWrij+HtdPMR7VVJvH7XdXByzu4DP8uyZPEv6c1C0O5FnA3/sx6zbw1srSFJDTNoYviRpQAa+JDWEgS9JDWHgS1JDGPiS1BAGviQ1hIEvSQ3x/4tppPoWqYdUAAAAAElFTkSuQmCC\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(union_item['count'].values[40000:])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "大概有75000个pair至少共现一次"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 新闻文章信息"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 44,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#不同类型的新闻出现的次数\n",
- "plt.plot(user_click_merge['category_id'].value_counts().values)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 45,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#出现次数比较少的新闻类型, 有些新闻类型,基本上就出现过几次\n",
- "plt.plot(user_click_merge['category_id'].value_counts().values[150:])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "count 1.630633e+06\n",
- "mean 2.043012e+02\n",
- "std 6.382198e+01\n",
- "min 0.000000e+00\n",
- "25% 1.720000e+02\n",
- "50% 1.970000e+02\n",
- "75% 2.290000e+02\n",
- "max 6.690000e+03\n",
- "Name: words_count, dtype: float64"
- ]
- },
- "execution_count": 46,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#新闻字数的描述性统计\n",
- "user_click_merge['words_count'].describe()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 47,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 47,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(user_click_merge['words_count'].values)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户点击的新闻类型的偏好\n",
- "\n",
- "此特征可以用于度量用户的兴趣是否广泛。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 48,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 48,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUlUlEQVR4nO3dfZBc1Xnn8e8zM3pBaCwkNBJCAiQbsKwEy8CYwoEihTG2wXGwY5dDditWHGrZsp3EjpNdw9q1dtXGu3YqNvFWsomJIaESyoGAMSQFwRhjezeJJY+MAAsEEuJFEnoZAXpBGAlJZ//oK2UkzfRtzfR097nz/VRNze3Tt/s+Z27rp9unT98bKSUkSfnrancBkqTmMNAlqSIMdEmqCANdkirCQJekiuhp5cZmz56dFi5c2MpNSlL2Vq5cuT2l1Fe2XksDfeHChQwMDLRyk5KUvYh4rpH1HHKRpIow0CWpIgx0SaoIA12SKsJAl6SKMNAlqSIMdEmqiCwC/a6HN/J3P25oGqYkTVhZBPo9q17g9oEN7S5DkjpaFoEuSSpnoEtSRWQT6F4pT5LqyyLQI6LdJUhSx8si0CVJ5Qx0SaqIbAI94SC6JNWTRaA7gi5J5bIIdElSOQNdkioim0B3Hrok1ZdFoDsNXZLKZRHokqRy2QS6Qy6SVF8mge6YiySVySTQJUllDHRJqohsAt0hdEmqL4tAd9qiJJXLItAlSeUMdEmqiGwCPTkRXZLqyiLQHUKXpHJZBLokqZyBLkkVYaBLUkVkEejOQ5ekclkEuiSpXEOBHhG/HxGrI+JnEfGtiJgaEYsiYnlErIuI2yJi8ngXK0kaWWmgR8R84PeA/pTSLwLdwNXAV4AbUkpnAi8D14xnoU5Dl6T6Gh1y6QFOiIgeYBqwGXgncEdx/y3AB5peXSGciS5JpUoDPaW0CfgT4HlqQb4TWAnsSCntL1bbCMwf7vERcW1EDETEwODgYHOqliQdo5Ehl5nAVcAi4FTgROC9jW4gpXRjSqk/pdTf19c36kIlSfU1MuTyLuCZlNJgSul14NvARcBJxRAMwAJg0zjVCEDyjOiSVFcjgf48cGFETIuIAC4DHgceAj5crLMMuHt8SnQeuiQ1opEx9OXUPvz8KfBY8Zgbgc8Cn4mIdcDJwE3jWKckqURP+SqQUvoC8IWjmtcDFzS9IknSqGTzTVHnoUtSfVkEumPoklQui0CXJJUz0CWpIrIJdIfQJam+LALdc7lIUrksAl2SVC6bQE/OW5SkuvIIdEdcJKlUHoEuSSploEtSRWQT6I6gS1J9WQS6Q+iSVC6LQJcklTPQJaki8gl0B9Elqa4sAj08f64klcoi0CVJ5Qx0SaqIbALdIXRJqi+LQHcEXZLKZRHokqRyBrokVUQ2ge750CWpviwC3WnoklQui0CXJJUz0CWpIrIJdEfQJam+LALdIXRJKpdFoEuSyhnoklQR2QS609Alqb4sAt3zoUtSuYYCPSJOiog7ImJNRDwREe+IiFkR8UBErC1+zxzvYiVJI2v0CP3rwD+nlBYDS4EngOuAB1NKZwEPFrclSW1SGugRMQO4BLgJIKW0L6W0A7gKuKVY7RbgA+NTYk1yJrok1dXIEfoiYBD464h4OCK+GREnAnNTSpuLdbYAc4d7cERcGxEDETEwODg4qiIdQZekco0Eeg9wHvAXKaVzgT0cNbySaqdCHPYQOqV0Y0qpP6XU39fXN9Z6JUkjaCTQNwIbU0rLi9t3UAv4rRExD6D4vW18Sqxx2qIk1Vca6CmlLcCGiHhz0XQZ8DhwD7CsaFsG3D0uFYJjLpLUgJ4G1/td4NaImAysBz5G7T+D2yPiGuA54CPjU6IkqRENBXpKaRXQP8xdlzW1GknSqGXxTVFwDF2SymQR6OEguiSVyiLQJUnlDHRJqggDXZIqIotA9+y5klQui0CXJJUz0CWpIrIJ9OREdEmqK4tAdwhdksplEeiSpHIGuiRVRDaB7gi6JNWXRaA7D12SymUR6JKkcga6JFVENoHuNHRJqi+LQPd86JJULotAlySVM9AlqSKyCfTkTHRJqiuLQHceuiSVyyLQJUnlsgl0py1KUn1ZBLpDLpJULotAlySVM9AlqSKyCXSH0CWpvkwC3UF0SSqTSaBLkspkE+hOW5Sk+rII9NcPHGT7K3vbXYYkdbQsAv3nrx+gd2pPu8uQpI7WcKBHRHdEPBwR/1TcXhQRyyNiXUTcFhGTx6vIOb1TnOYiSSWO5wj9U8ATQ25/BbghpXQm8DJwTTMLG6o7ggMOoktSXQ0FekQsAN4HfLO4HcA7gTuKVW4BPjAO9QHQ3RUcOGigS1I9jR6h/ynwX4GDxe2TgR0ppf3F7Y3A/OEeGBHXRsRARAwMDg6Orsiu4KBH6JJUV2mgR8SvANtSSitHs4GU0o0ppf6UUn9fX99onqI25OIRuiTV1cjUkYuAX42IK4GpwBuArwMnRURPcZS+ANg0XkV2BZjnklRf6RF6Sun6lNKClNJC4Grg+yml/wg8BHy4WG0ZcPe4FdlV++r/QVNdkkY0lnnonwU+ExHrqI2p39Scko7VXZwQ3ZkukjSy4/q2TkrpB8APiuX1wAXNL+lYh47QDxxMTOpuxRYlKT9ZfFN0589fB2Dv/oMla0rSxJVFoJ86YyqAM10kqY4sAr2nu1bm/oMeoUvSSPII9GIMff8Bj9AlaSR5BPqhI3QDXZJGlEegHzpCd8hFkkaURaDvO1AL8hf37GtzJZLUubII9FNnnAD4TVFJqieLQJ86qVbmoSN1SdKxsgj0ScWHovv8YpEkjSiLQO/prn0o+tyLr7a5EknqXFkE+uzpUwCYMimLciWpLbJIyCk9DrlIUpksAn2ygS5JpfII9OJD0Uc27mhvIZLUwbII9ENf/T8020WSdKxsEnLJvDfw1NZX2l2GJHWsbAJ9z779nDjZyxVJ0kiyCfSz5/by6Kad7S5DkjpWNoH+2usHiHYXIUkdLJtAP/f0mezdf9ATdEnSCLIJ9JRqQf7Czp+3uRJJ6kzZBPpb5r0BgMHde9tciSR1pmwCfcYJkwBYs2V3myuRpM6UTaAvPqUXgE0vO+QiScPJJtB7p9aO0Fc8+1KbK5GkzpRNoE/u6WLxKb28+Ipj6JI0nGwCHeCkaZN4enAPe/bub3cpktRxsgr0S87uA+ClPfvaXIkkdZ6sAv3MvukA3PaTDW2uRJI6T1aB/stvrh2hv+KQiyQdI6tAn9LTTV/vFP7mX5/l1X2GuiQNlVWgA1z0ppMBvzEqSUcrDfSIOC0iHoqIxyNidUR8qmifFREPRMTa4vfM8S8XrjhnHgB/8t2nWrE5ScpGI0fo+4E/SCktAS4EPhkRS4DrgAdTSmcBDxa3x92Fb6wdoT+7fU8rNidJ2SgN9JTS5pTST4vl3cATwHzgKuCWYrVbgA+MU41HmHHCJN6/9FQe27STex/b3IpNSlIWjmsMPSIWAucCy4G5KaVDiboFmDvCY66NiIGIGBgcHBxLrYf92rnzAfju6i1NeT5JqoKGAz0ipgN3Ap9OKe0ael+qnax82CtPpJRuTCn1p5T6+/r6xlTsIZcunsPiU3r5zqoXWPGM53aRJGgw0CNiErUwvzWl9O2ieWtEzCvunwdsG58Sh/f+pacCcNfDm1q5WUnqWI3McgngJuCJlNLXhtx1D7CsWF4G3N388kb2yUvPZOHJ01i+/kV+8GRL/y+RpI7UyBH6RcBvAu+MiFXFz5XAl4HLI2It8K7idktddOZsNrz8Kv/noadbvWlJ6jg9ZSuklP4fECPcfVlzyzk+X/rgOWzbvZdHNuzgtp88z0f6T6P2hkKSJp7svil6tCXz3sC23Xv57J2P8cLO19pdjiS1TfaB/vuXn82f/YdzAbhz5UbWbNlV8ghJqqbsAx3gjFknAvC1B57iD25/pM3VSFJ7VCLQz1kwg4HPv4srfvEUtu56jR89NcimHV5MWtLEUolAB5g9fQoLZ5/I9lf28dGbV/Cfbhlod0mS1FKVCXSAT112Fnd+/Je4fMlctu56jSe37Gb94CvUvsgqSdVWqUCfOqmb88+Yydlzp/Pinn28509/xDu/+kP+8VFP4iWp+krnoefo2kvexDnzZ/Da6wf59G2reGZwDzte3UcQzJg2qd3lSdK4iFYOR/T396eBgdaNbaeUOPvz9/H6gX/v4/VXLOY///KbWlaDJI1VRKxMKfWXrVfJI/RDIoIbP9p/+GIYNzzwFOsHvTCGpGqqdKADXPrmOfDm2vKty5/nH1Zu4DuramdojIAvvv8XuPqC09tYoSQ1R+UDfajPXfkWfvzMi4dv/92/PccjG3dy9QVtLEqSmmRCBfqli+dw6eI5h28/sHord6/axL+s2364racr+J+/ds7ha5dKUi4mVKAf7ROXnnlEmKeU+M6qFxh49iUDXVJ2Kj3LZTTO/tx9zJ4+mQUzpx3R3tUF/+U9izn/jJltqkzSRNXoLJdKfbGoGX7rooWccfKJdHfFET8/Xv8SP/TKSJI62IQechnOf7vyLcO2n/OF+/mnxzazfvvw0x7n9E7l8+97C11dXmBDUnsY6A1631vnseLZl3h887HnW9/92n4Gd+/lYxct5LRZ04Z5tCSNPwO9QV/+0FtHvO/exzbziVt/yg3fe4qZ0ybXfZ63LpjBVW+b3+zyJMlAb4az5/Yye/oUvrt6a9319u4/QO/USQa6pHFhoDfBmXOmM/D5d5Wu9+X71vDN/7uev/rR+oafu6sr+NWlp9LXO2UsJUqaAAz0FjprznT2H0x86d4njutxr+7dz+9edtY4VSWpKgz0FvrQ+Qu44pxTOHgcU//f/kff4+ENO7i7OP/MaJw+axrnnu78eanqDPQWmzb5+P7k82eewPfXbOP7a0Y/B35KTxdr/sd7iXBKpVRlBnqHu+sTv8S23XtH/fjbBzbwjR+u50drtzOlp3nfI+vuCpYuOInJTXxOSWNjoHe43qmT6J06+qssLT6lF4BlN69oVkmH/fdfWcJvX7yo6c8raXQM9Ip7/1tP5bSZ09h34GBTn3fZzStYu+2VwxcPaYepk7o5ZcbUtm1f6jQGesX1dHfRv3BW05931omT+daK5/nWiueb/tzH486Pv4Pzz2h+/6QcGegalZuWvZ2123a3bfvbdu3lf923hme2v8qSeTPaVsfRpk7q8sNntY2nz1WWtu1+jQu+9GC7yzjGh85bwFc/srTdZahivEi0Km1O71Ru+PWlbN01+hlAzXbHyo08tbV971okA13Z+uC5C9pdwhEef2EX//joC5zzxfvbXcqE9JnLz+ZjF03sWVcGutQk11y8iJOn1z/bpsbHXQ9vYuC5lw30sTw4It4LfB3oBr6ZUvpyU6qSMrT0tJNYetpJ7S5jQlr53Mv8YM02Lv/aD9tdyohuWvZ2Tj95fK+XMOpAj4hu4M+By4GNwE8i4p6U0uPNKk6SGnHNxYu4f/WWdpdRVyu+VT2WI/QLgHUppfUAEfH3wFWAgS6ppa5623yvM8DYLhI9H9gw5PbGou0IEXFtRAxExMDg4OAYNidJqmfc3wOklG5MKfWnlPr7+vrGe3OSNGGNJdA3AacNub2gaJMktcFYAv0nwFkRsSgiJgNXA/c0pyxJ0vEa9YeiKaX9EfE7wP3Upi3enFJa3bTKJEnHZUzz0FNK9wL3NqkWSdIYeLkZSaoIA12SKqKlp8+NiEHguVE+fDawvYnl5MA+Twz2ufrG2t8zUkql875bGuhjEREDjZwPuErs88Rgn6uvVf11yEWSKsJAl6SKyCnQb2x3AW1gnycG+1x9LelvNmPokqT6cjpClyTVYaBLUkVkEegR8d6IeDIi1kXEde2u53hFxLMR8VhErIqIgaJtVkQ8EBFri98zi/aIiP9d9PXRiDhvyPMsK9ZfGxHLhrSfXzz/uuKx0YY+3hwR2yLiZ0Paxr2PI22jjX3+YkRsKvb1qoi4csh91xf1PxkR7xnSPuzruzjx3fKi/bbiJHhExJTi9rri/oUt6u9pEfFQRDweEasj4lNFe2X3c50+d+Z+Til19A+1E389DbwRmAw8Aixpd13H2YdngdlHtf0xcF2xfB3wlWL5SuA+IIALgeVF+yxgffF7ZrE8s7hvRbFuFI+9og19vAQ4D/hZK/s40jba2OcvAn84zLpLitfuFGBR8Zrurvf6Bm4Hri6W/xL4eLH8CeAvi+Wrgdta1N95wHnFci/wVNGvyu7nOn3uyP3c0n/0o/yDvgO4f8jt64Hr213XcfbhWY4N9CeBeUNeNE8Wy98AfuPo9YDfAL4xpP0bRds8YM2Q9iPWa3E/F3JkuI17H0faRhv7PNI/9CNet9TOUvqOkV7fRaBtB3qK9sPrHXpssdxTrBdt2N93U7umcOX38zB97sj9nMOQS0OXuutwCfhuRKyMiGuLtrkppc3F8hZgbrE8Un/rtW8cpr0TtKKPI22jnX6nGGK4ecjQwPH2+WRgR0pp/1HtRzxXcf/OYv2WKd7+nwssZ4Ls56P6DB24n3MI9Cq4OKV0HnAF8MmIuGTonan2X3Cl54+2oo8d8nf8C+BNwNuAzcBX21rNOIiI6cCdwKdTSruG3lfV/TxMnztyP+cQ6Nlf6i6ltKn4vQ24C7gA2BoR8wCK39uK1Ufqb732BcO0d4JW9HGkbbRFSmlrSulASukg8FfU9jUcf59fBE6KiJ6j2o94ruL+GcX64y4iJlELtltTSt8umiu9n4frc6fu5xwCPetL3UXEiRHRe2gZeDfwM2p9OPTp/jJqY3MU7R8tZghcCOws3mreD7w7ImYWb+/eTW2sbTOwKyIuLGYEfHTIc7VbK/o40jba4lDoFD5IbV9Drc6ri5kLi4CzqH0AOOzruzgKfQj4cPH4o/9+h/r8YeD7xfrjqvjb3wQ8kVL62pC7KrufR+pzx+7ndnywMIoPIq6k9uny08Dn2l3Pcdb+RmqfaD8CrD5UP7WxsAeBtcD3gFlFewB/XvT1MaB/yHP9NrCu+PnYkPb+4gX1NPBntOcDsm9Re+v5OrVxwGta0ceRttHGPv9t0adHi3+Q84as/7mi/icZMhNppNd38dpZUfwt/gGYUrRPLW6vK+5/Y4v6ezG1oY5HgVXFz5VV3s91+tyR+9mv/ktSReQw5CJJaoCBLkkVYaBLUkUY6JJUEQa6JFWEgS5JFWGgS1JF/H85cMkmMcaqfgAAAABJRU5ErkJggg==\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 49,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " category_id \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 250000.000000 \n",
- " 250000.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 124999.500000 \n",
- " 4.573188 \n",
- " \n",
- " \n",
- " std \n",
- " 72168.927986 \n",
- " 4.419800 \n",
- " \n",
- " \n",
- " min \n",
- " 0.000000 \n",
- " 1.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 62499.750000 \n",
- " 2.000000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 124999.500000 \n",
- " 3.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 187499.250000 \n",
- " 6.000000 \n",
- " \n",
- " \n",
- " max \n",
- " 249999.000000 \n",
- " 95.000000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "#点击次数在前50的用户\n",
+ "plt.plot(user_click_item_count[:50])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "点击次数排前50的用户的点击次数都在100次以上。思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 35,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXEAAAD4CAYAAAAaT9YAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAARV0lEQVR4nO3dfYxc1X3G8eexd7ExEDAYjEPYrkOQFZekKUxT2lKgJQHHSuWGphJIDaRYWaUBKUitKJQqRWlTNYnaSFWippvaMonASZsUGSVtg4tSXKkYYqd+WQqYlwLxSzAvcYgIBYxP/5i7u6Nl786dmTt7z5n7/UjWzt6Z3fmdnfGjM+ece65DCAIApGlB1QUAALpHiANAwghxAEgYIQ4ACSPEASBhQ/P5ZMuWLQujo6Pz+ZQAkLydO3c+H0I4fbb75jXER0dHtWPHjvl8SgBInu2n8+5jOAUAEkaIA0DCCHEASBghDgAJI8QBIGFtQ9z2RtuHbU/Mct8f2g62l/WnPADAXIr0xDdJWjPzoO2zJV0u6ZmSawIAFNR2nXgIYZvt0Vnu+oKkmyRtKbuome59+Fnt/uGRfj/Nm5yy5Dh99FdHtWCB5/25AaCIrk72sb1O0oEQwm577oCzPSZpTJJGRka6eTrdt+85fW177lr3vpjcZv2SVafrnNNPnNfnBoCiOg5x20sk/YmaQylthRDGJY1LUqPR6OoKFJ9ed54+ve68bn60a9/ec1A33PnfeuMYF80AEK9uVqecI2mlpN22n5L0Nkk/sH1mmYXFggsfAYhZxz3xEMJeSWdMfp8FeSOE8HyJdVXOYhwcQPyKLDHcLOl+Sats77e9vv9lVW9yqD+IrjiAeBVZnXJ1m/tHS6smIpP9cIZTAMSMMzZzTPXECXEAESPE22A4BUDMCPFcTGwCiB8hnoPhFAApIMRz0A8HkAJCvA164gBiRojnaLcnDADEgBDPMbVOnNUpACJGiOdgYhNACgjxHNOn3QNAvAhxAEgYIZ5jchfDwHgKgIgR4nkYTgGQAEI8B7sYAkgBIZ5jep04KQ4gXoQ4ACSMEM/BcAqAFBDiOVgnDiAFhHiO6SWGFRcCAHMgxHOw/xWAFBDibXCyD4CYEeI5WGAIIAWEeB52MQSQAEI8x9TEJn1xABEjxHMwsQkgBYR4O3TEAUSMEM/BxCaAFBDiOSY3wGJiE0DMCPEcjIkDSAEhnoOr3QNIASHeBsMpAGLWNsRtb7R92PZEy7E/t73H9i7b99h+a3/LnH/sYgggBUV64pskrZlx7PMhhHeHEN4j6duSPlVyXRFgUBxA/IbaPSCEsM326IxjL7V8e4IGsMM6tKAZ4us3fV8LIp/lPGHRQm25/iKNnLak6lIAzLO2IZ7H9mckXSPpJ5J+Y47HjUkak6SRkZFun27erX7rW3TTmlX66f8drbqUOR088oq27DqoA0deIcSBGuo6xEMIt0q61fYtkm6Q9Gc5jxuXNC5JjUYjmR778MIF+sSl76i6jLa2P/mCtuw6yCoaoKbKWJ1yh6TfKeH3oAtTAz1kOFBLXYW47XNbvl0n6ZFyykG3yHCgntoOp9jeLOlSScts71dz2GSt7VWSjkl6WtLH+1kk8rE9AFBvRVanXD3L4Q19qAVdiHzhDIA+44zNxLE9AFBvhHjizGXkgFojxAcEGQ7UEyGevMmJTWIcqCNCPHFMbAL1RognjsvIAfVGiCfO7JkL1BohPiBYYgjUEyGeuKnhFDIcqCVCPHFMbAL1RognzmLvFKDOCPHEMa8J1BshPiA42QeoJ0J8QBDhQD0R4oljYhOoN0I8cUxsAvVGiCfOXGQTqDVCfEDQEwfqiRBPHEsMgXojxBNnMbMJ1BkhnjguzwbUGyGeOC6UDNQbIT4g6IkD9USIJ46JTaDeCPHkMbEJ1BkhnrjpiU364kAdEeKJox8O1BshPiDoiAP1RIgnbvJq9ywxBOqJEE8cwylAvbUNcdsbbR+2PdFy7PO2H7G9x/Zdtk/pa5XIxRmbQL0V6YlvkrRmxrGtks4LIbxb0j5Jt5RcFwpiP3Gg3obaPSCEsM326Ixj97R8u13Sh0uuCx26b99zOvLK61WX0bMVJy/W2netqLoMIBltQ7yA6yR9I+9O22OSxiRpZGSkhKdDq5OXDOvERUO6e/dB3b37YNXllGLvbZfrpMXDVZcBJKGnELd9q6Sjku7Ie0wIYVzSuCQ1Gg0+9Jfs5OOHteNP36dXjx6rupSe3fnAM/rsvz2io2/wNgGK6jrEbX9U0gclXRY4XbBSi4cXavHwwqrL6Nnxw80pGt5MQHFdhbjtNZJuknRJCOFn5ZaEurJZMAl0qsgSw82S7pe0yvZ+2+slfVHSSZK22t5l+8t9rhM1wD4wQOeKrE65epbDG/pQC2pu+gIXAIrijE1Eh444UBwhjniwDwzQMUIc0WBaE+gcIY5omEFxoGOEOKJDhgPFEeKIBpt5AZ0jxBGNqXXi9MWBwghxRIOJTaBzhDiiwQUugM4R4ogOGQ4UR4gjGtMTm8Q4UBQhjngwnAJ0jBBHNJjYBDpHiANAwghxRGPyohAMpwDFEeKIxvTWKaQ4UBQhjmiwThzoHCEOAAkjxBGN6b1TABRFiCManOwDdI4QRzToiQOdI8QRHTriQHGEOAAkjBBHNMxFNoGOEeKIxlSEk+FAYYQ4osHEJtA5QhzRoScOFEeIIxpmM1qgY4Q4osHV7oHOEeKIBhObQOcIcUSDXQyBzrUNcdsbbR+2PdFy7HdtP2T7mO1Gf0tE3TCcAhRXpCe+SdKaGccmJF0paVvZBaHOmNgEOjXU7gEhhG22R2cce1hqPcMO6N2C7O30e//wgIYWDv5I37lnnKg7P3Zh1WUgcW1DvFe2xySNSdLIyEi/nw4J++WVp+m6X1upV15/o+pS+m7vgSP6rydeqLoMDIC+h3gIYVzSuCQ1Gg0GO5Hr5CXD+tRvra66jHnxha37NHHgparLwAAY/M+sQISmV+LQr0FvCHGgQmQ4elVkieFmSfdLWmV7v+31tj9ke7+kX5H0Hdvf7XehwCCZuhRdxXUgfUVWp1ydc9ddJdcC1AYLu1AWhlOACkxvMUBfHL0hxIEKsHc6ykKIAxWiI45eEeJABSbPdmafGPSKEAeAhBHiQAXYdhdlIcSBCnApOpSFEAcqRE8cvSLEgQpwPVGUhRAHKsBgCspCiAMVYGITZSHEgQqwARbKQogDFWLvFPSKEAcqwN4pKAshDgAJI8SBCkztnUJXHD0ixIEKTC0xJMTRI0IcqBAn+6BXhDhQAdaJoyyEOFABzthEWQhxoALTF4UAekOIAxWYHk4hxtEbQhyoEBGOXhHiQAUmx8TpiKNXhDhQBTO1iXIQ4kAFpnriDKigR4Q4UAFPpzjQE0IcqBAZjl4R4kAFpi4KQYqjR4Q4UAHmNVGWtiFue6Ptw7YnWo6danur7ceyr0v7WyYwWJjYRFmK9MQ3SVoz49jNku4NIZwr6d7sewAFsQEWyjLU7gEhhG22R2ccXifp0uz27ZL+Q9Ifl1kYUAf/sveQli45ruoyorRoeIHev3q5Fg0trLqUqLUN8RzLQwiHsts/krQ874G2xySNSdLIyEiXTwcMlmUnLpIk/cV3Hq64krj9/Ucu0BU/f2bVZUSt2xCfEkIItnM/FIYQxiWNS1Kj0eDDIyDpsncu1/ZbLtNrR49VXUqUnn7xZX1kw4N6lb9PW92G+LO2V4QQDtleIelwmUUBdXDmyYurLiFar73RDG92eWyv2yWGd0u6Nrt9raQt5ZQDACzB7ESRJYabJd0vaZXt/bbXS/orSe+3/Zik92XfA0ApyPDiiqxOuTrnrstKrgUAJLVc+YjRlLY4YxNAtDgZqj1CHEB0uGhGcYQ4gOgwsVkcIQ4gOuzyWBwhDiA6U3vLVFtGEghxANHiZJ/2CHEA0SLC2yPEAUSHic3iCHEA0TGD4oUR4gCiw5WPiiPEAUSLec32CHEA0WE0pThCHEB0zD6GhRHiAKLDhaSLI8QBRIeJzeIIcQDRoifeHiEOID5MbBZGiAOIDhObxRHiAKJjrgpRGCEOIDrTE5tohxAHEC064u0R4gCiM321e1K8HUIcQHSY1iyOEAcQHfZOKY4QBxAdLpRcHCEOIFpkeHuEOID4TG2ARYy3Q4gDiA7X2CyOEAcQHTK8OEIcQHSm14lXXEgCCHEA0WI/8fZ6CnHbn7Q9Yfsh2zeWVBOAmmP/q+K6DnHb50n6mKT3SvoFSR+0/Y6yCgNQX0xsFjfUw8++U9IDIYSfSZLt+yRdKelzZRQGoL4mT/b5yn8+qW/u3F9xNeX4yyvfpV8aPbX039tLiE9I+ozt0yS9ImmtpB0zH2R7TNKYJI2MjPTwdADqYvHwAn38knP0zIsvV11KaY4fXtiX3+teFtPbXi/pE5JelvSQpFdDCDfmPb7RaIQdO96U8wCAOdjeGUJozHZfTxObIYQNIYQLQggXS/qxpH29/D4AQGd6GU6R7TNCCIdtj6g5Hn5hOWUBAIroKcQlfSsbE39d0vUhhCO9lwQAKKqnEA8h/HpZhQAAOscZmwCQMEIcABJGiANAwghxAEhYTyf7dPxk9nOSnu7yx5dJer7EclJAm+uBNtdDL23+uRDC6bPdMa8h3gvbO/LOWBpUtLkeaHM99KvNDKcAQMIIcQBIWEohPl51ARWgzfVAm+uhL21OZkwcAPBmKfXEAQAzEOIAkLAkQtz2GtuP2n7c9s1V19ML20/Z3mt7l+0d2bFTbW+1/Vj2dWl23Lb/Nmv3Htvnt/yea7PHP2b72qraMxvbG20ftj3Rcqy0Ntq+IPsbPp79bOVXZMxp8222D2Sv9S7ba1vuuyWr/1HbV7Qcn/W9bnul7Qey49+wfdz8tW52ts+2/T3b/5NdLP2T2fGBfa3naHN1r3UIIep/khZKekLS2yUdJ2m3pNVV19VDe56StGzGsc9Jujm7fbOkz2a310r6VzUv/n2hmtc0laRTJT2ZfV2a3V5addta2nOxpPMlTfSjjZIezB7r7Gc/EGmbb5P0R7M8dnX2Pl4kaWX2/l4413td0j9Kuiq7/WVJfxBBm1dIOj+7fZKaF4VZPciv9Rxtruy1TqEn/l5Jj4cQngwhvCbp65LWVVxT2dZJuj27fbuk3245/tXQtF3SKbZXSLpC0tYQwoshhB9L2ippzTzXnCuEsE3SizMOl9LG7L63hBC2h+a7/Kstv6syOW3Os07S10MIr4YQ/lfS42q+z2d9r2e9z9+U9M3s51v/fpUJIRwKIfwgu/1TSQ9LOksD/FrP0eY8fX+tUwjxsyT9sOX7/Zr7jxa7IOke2zvdvIi0JC0PIRzKbv9I0vLsdl7bU/yblNXGs7LbM4/H6oZs6GDj5LCCOm/zaZKOhBCOzjgeDdujkn5R0gOqyWs9o81SRa91CiE+aC4KIZwv6QOSrrd9ceudWY9joNd91qGNmb+TdI6k90g6JOmvK62mT2yfKOlbkm4MIbzUet+gvtaztLmy1zqFED8g6eyW79+WHUtSCOFA9vWwpLvU/Fj1bPbRUdnXw9nD89qe4t+krDYeyG7PPB6dEMKzIYQ3QgjHJH1Fzdda6rzNL6g59DA043jlbA+rGWZ3hBD+OTs80K/1bG2u8rVOIcS/L+ncbMb2OElXSbq74pq6YvsE2ydN3pZ0uaQJNdszOSN/raQt2e27JV2TzepfKOkn2cfU70q63PbS7GPb5dmxmJXSxuy+l2xfmI0fXtPyu6IyGWSZD6n5WkvNNl9le5HtlZLOVXMCb9b3etab/Z6kD2c/3/r3q0z2998g6eEQwt+03DWwr3Vemyt9rauc6S36T81Z7X1qzubeWnU9PbTj7WrOQu+W9NBkW9QcB7tX0mOS/l3SqdlxS/pS1u69khotv+s6NSdJHpf0+1W3bUY7N6v5kfJ1Ncf01pfZRkmN7D/JE5K+qOzM4wjb/LWsTXuy/8wrWh5/a1b/o2pZcZH3Xs/eOw9mf4t/krQogjZfpOZQyR5Ju7J/awf5tZ6jzZW91px2DwAJS2E4BQCQgxAHgIQR4gCQMEIcABJGiANAwghxAEgYIQ4ACft/AbwTsfQSxAYAAAAASUVORK5CYII=\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id category_id\n",
- "count 250000.000000 250000.000000\n",
- "mean 124999.500000 4.573188\n",
- "std 72168.927986 4.419800\n",
- "min 0.000000 1.000000\n",
- "25% 62499.750000 2.000000\n",
- "50% 124999.500000 3.000000\n",
- "75% 187499.250000 6.000000\n",
- "max 249999.000000 95.000000"
- ]
- },
- "execution_count": 49,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户查看文章的长度的分布\n",
- "\n",
- "通过统计不同用户点击新闻的平均字数,这个可以反映用户是对长文更感兴趣还是对短文更感兴趣。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 50,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 50,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从上图中可以发现有一小部分人看的文章平均词数非常高,也有一小部分人看的平均文章次数非常低。\n",
- "\n",
- "大多数人偏好于阅读字数在200-400字之间的新闻。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 51,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 51,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "#挑出大多数人的区间仔细看看\n",
- "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000])"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "可以发现大多数人都是看250字以下的文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 52,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " count \n",
- " 250000.000000 \n",
- " 250000.000000 \n",
- " \n",
- " \n",
- " mean \n",
- " 124999.500000 \n",
- " 205.830189 \n",
- " \n",
- " \n",
- " std \n",
- " 72168.927986 \n",
- " 47.174030 \n",
- " \n",
- " \n",
- " min \n",
- " 0.000000 \n",
- " 8.000000 \n",
- " \n",
- " \n",
- " 25% \n",
- " 62499.750000 \n",
- " 187.500000 \n",
- " \n",
- " \n",
- " 50% \n",
- " 124999.500000 \n",
- " 202.000000 \n",
- " \n",
- " \n",
- " 75% \n",
- " 187499.250000 \n",
- " 217.750000 \n",
- " \n",
- " \n",
- " max \n",
- " 249999.000000 \n",
- " 3434.500000 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "#点击次数排名在[25000:50000]之间\n",
+ "plt.plot(user_click_item_count[25000:50000])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻点击次数分析"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:42:14.526476Z",
+ "start_time": "2020-11-13T15:42:14.463642Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'].count(), reverse=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T15:42:16.198000Z",
+ "start_time": "2020-11-13T15:42:16.044455Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 37,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id words_count\n",
- "count 250000.000000 250000.000000\n",
- "mean 124999.500000 205.830189\n",
- "std 72168.927986 47.174030\n",
- "min 0.000000 8.000000\n",
- "25% 62499.750000 187.500000\n",
- "50% 124999.500000 202.000000\n",
- "75% 187499.250000 217.750000\n",
- "max 249999.000000 3434.500000"
- ]
- },
- "execution_count": 52,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "#更加详细的参数\n",
- "user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 用户点击新闻的时间分析"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 53,
- "metadata": {},
- "outputs": [],
- "source": [
- "#为了更好的可视化,这里把时间进行归一化操作\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "mm = MinMaxScaler()\n",
- "user_click_merge['click_timestamp'] = mm.fit_transform(user_click_merge[['click_timestamp']])\n",
- "user_click_merge['created_at_ts'] = mm.fit_transform(user_click_merge[['created_at_ts']])\n",
- "\n",
- "user_click_merge = user_click_merge.sort_values('click_timestamp')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 54,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " rank \n",
- " click_cnts \n",
- " category_id \n",
- " created_at_ts \n",
- " words_count \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 18 \n",
- " 249990 \n",
- " 162300 \n",
- " 0.000000 \n",
- " 4 \n",
- " 3 \n",
- " 20 \n",
- " 1 \n",
- " 25 \n",
- " 2 \n",
- " 5 \n",
- " 5 \n",
- " 281 \n",
- " 0.989186 \n",
- " 193 \n",
- " \n",
- " \n",
- " 2 \n",
- " 249998 \n",
- " 160974 \n",
- " 0.000002 \n",
- " 4 \n",
- " 1 \n",
- " 12 \n",
- " 1 \n",
- " 13 \n",
- " 2 \n",
- " 5 \n",
- " 5 \n",
- " 281 \n",
- " 0.989092 \n",
- " 259 \n",
- " \n",
- " \n",
- " 30 \n",
- " 249985 \n",
- " 160974 \n",
- " 0.000003 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 8 \n",
- " 2 \n",
- " 8 \n",
- " 8 \n",
- " 281 \n",
- " 0.989092 \n",
- " 259 \n",
- " \n",
- " \n",
- " 50 \n",
- " 249979 \n",
- " 162300 \n",
- " 0.000004 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 25 \n",
- " 2 \n",
- " 2 \n",
- " 2 \n",
- " 281 \n",
- " 0.989186 \n",
- " 193 \n",
- " \n",
- " \n",
- " 25 \n",
- " 249988 \n",
- " 160974 \n",
- " 0.000004 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 21 \n",
- " 2 \n",
- " 17 \n",
- " 17 \n",
- " 281 \n",
- " 0.989092 \n",
- " 259 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "plt.plot(item_click_count)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 38,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id click_article_id click_timestamp click_environment \\\n",
- "18 249990 162300 0.000000 4 \n",
- "2 249998 160974 0.000002 4 \n",
- "30 249985 160974 0.000003 4 \n",
- "50 249979 162300 0.000004 4 \n",
- "25 249988 160974 0.000004 4 \n",
- "\n",
- " click_deviceGroup click_os click_country click_region \\\n",
- "18 3 20 1 25 \n",
- "2 1 12 1 13 \n",
- "30 1 17 1 8 \n",
- "50 1 17 1 25 \n",
- "25 1 17 1 21 \n",
- "\n",
- " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
- "18 2 5 5 281 0.989186 \n",
- "2 2 5 5 281 0.989092 \n",
- "30 2 8 8 281 0.989092 \n",
- "50 2 2 2 281 0.989186 \n",
- "25 2 17 17 281 0.989092 \n",
- "\n",
- " words_count \n",
- "18 193 \n",
- "2 259 \n",
- "30 259 \n",
- "50 193 \n",
- "25 259 "
- ]
- },
- "execution_count": 54,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "user_click_merge.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 55,
- "metadata": {},
- "outputs": [],
- "source": [
- "def mean_diff_time_func(df, col):\n",
- " df = pd.DataFrame(df, columns={col})\n",
- " df['time_shift1'] = df[col].shift(1).fillna(0)\n",
- " df['diff_time'] = abs(df[col] - df['time_shift1'])\n",
- " return df['diff_time'].mean()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 点击时间差的平均值\n",
- "mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'click_timestamp'))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 57,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(mean_diff_click_time.values, reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从上图可以发现不同用户点击文章的时间差是有差异的"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 前后点击文章的创建时间差的平均值\n",
- "mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'created_at_ts'))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/plain": [
- "[]"
- ]
- },
- "execution_count": 59,
- "metadata": {},
- "output_type": "execute_result"
- },
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
- }
- ],
- "source": [
- "plt.plot(sorted(mean_diff_created_time.values, reverse=True))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "从图中可以发现用户先后点击文章,文章的创建时间也是有差异的"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {},
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Defaulting to user installation because normal site-packages is not writeable\n",
- "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n",
- "Collecting gensim\n",
- " Downloading https://mirrors.aliyun.com/pypi/packages/2b/e0/fa6326251692056dc880a64eb22117e03269906ba55a6864864d24ec8b4e/gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)\n",
- "\u001b[K |████████████████████████████████| 24.2 MB 91.0 MB/s eta 0:00:01\n",
- "\u001b[?25hRequirement already satisfied: six>=1.5.0 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.15.0)\n",
- "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
- "Requirement already satisfied: scipy>=0.18.1 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.5.4)\n",
- "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
- "Collecting smart-open>=1.8.1\n",
- " Downloading https://mirrors.aliyun.com/pypi/packages/e3/cf/6311dfb0aff3e295d63930dea72e3029800242cdfe0790478e33eccee2ab/smart_open-4.0.1.tar.gz (117 kB)\n",
- "\u001b[K |████████████████████████████████| 117 kB 96.7 MB/s eta 0:00:01\n",
- "\u001b[?25hBuilding wheels for collected packages: smart-open\n",
- " Building wheel for smart-open (setup.py) ... \u001b[?25ldone\n",
- "\u001b[?25h Created wheel for smart-open: filename=smart_open-4.0.1-py3-none-any.whl size=108249 sha256=50eb67320a58790e8b173971aeb6af7b636d48259d7c9de759612e58e334215b\n",
- " Stored in directory: /home/admin/.cache/pip/wheels/c3/14/fc/a0e523e5d2f13d083ce0af09d4e2861d8e2ec65fc466fb1dff\n",
- "Successfully built smart-open\n",
- "Installing collected packages: smart-open, gensim\n",
- "Successfully installed gensim-3.8.3 smart-open-4.0.1\n"
- ]
- }
- ],
- "source": [
- "# 安装gensim\n",
- "!pip install gensim"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 44,
- "metadata": {},
- "outputs": [],
- "source": [
- "from gensim.models import Word2Vec\n",
- "import logging, pickle\n",
- "\n",
- "# 需要注意这里模型只迭代了一次\n",
- "def trian_item_word2vec(click_df, embed_size=16, save_name='item_w2v_emb.pkl', split_char=' '):\n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " # 只有转换成字符串才可以进行训练\n",
- " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
- " # 转换成句子的形式\n",
- " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
- " docs = docs['click_article_id'].values.tolist()\n",
- "\n",
- " # 为了方便查看训练的进度,这里设定一个log信息\n",
- " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
- "\n",
- " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
- " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=10)\n",
- " \n",
- " # 保存成字典的形式\n",
- " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
- " \n",
- " return item_w2v_emb_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 45,
- "metadata": {},
- "outputs": [],
- "source": [
- "item_w2v_emb_dict = trian_item_word2vec(user_click_merge)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "text/html": [
- "\n",
- "\n",
- "
\n",
- " \n",
- " \n",
- " \n",
- " user_id \n",
- " click_article_id \n",
- " click_timestamp \n",
- " click_environment \n",
- " click_deviceGroup \n",
- " click_os \n",
- " click_country \n",
- " click_region \n",
- " click_referrer_type \n",
- " \n",
- " \n",
- " \n",
- " \n",
- " 25667 \n",
- " 190841 \n",
- " 199197 \n",
- " 1507045276129 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 25668 \n",
- " 190841 \n",
- " 285298 \n",
- " 1507045302920 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 25669 \n",
- " 190841 \n",
- " 156624 \n",
- " 1507046638885 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 25670 \n",
- " 190841 \n",
- " 129029 \n",
- " 1507046668885 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 20 \n",
- " 2 \n",
- " \n",
- " \n",
- " 107739 \n",
- " 164226 \n",
- " 214800 \n",
- " 1507131402464 \n",
- " 4 \n",
- " 1 \n",
- " 17 \n",
- " 1 \n",
- " 21 \n",
- " 2 \n",
- " \n",
- " \n",
- "
\n",
- "
"
+ "source": [
+ "plt.plot(item_click_count[:100])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以看出点击次数最多的前100篇新闻,点击次数大于1000次"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 39,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(item_click_count[:20])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "点击次数最多的前20篇新闻,点击次数大于2500。思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 40,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(item_click_count[3500:])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以发现很多新闻只被点击过一两次。思路:可以定义这些新闻是冷门新闻"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻共现频次:两篇新闻连续出现的次数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 433597.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 3.184139 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 18.851753 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 2.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 2202.000000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " count\n",
+ "count 433597.000000\n",
+ "mean 3.184139\n",
+ "std 18.851753\n",
+ "min 1.000000\n",
+ "25% 1.000000\n",
+ "50% 1.000000\n",
+ "75% 2.000000\n",
+ "max 2202.000000"
+ ]
+ },
+ "execution_count": 41,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "tmp = user_click_merge.sort_values('click_timestamp')\n",
+ "tmp['next_item'] = tmp.groupby(['user_id'])['click_article_id'].transform(lambda x:x.shift(-1))\n",
+ "union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].agg({'count'}).reset_index().sort_values('count', ascending=False)\n",
+ "union_item[['count']].describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "由统计数据可以看出,平均共现次数3.18,最高为2202。\n",
+ "\n",
+ "说明用户看的新闻,相关性是比较强的。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ ""
+ ]
+ },
+ "execution_count": 42,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#画个图直观地看一看\n",
+ "x = union_item['click_article_id']\n",
+ "y = union_item['count']\n",
+ "plt.scatter(x, y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 43,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 43,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXwAAAD4CAYAAADvsV2wAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAATdElEQVR4nO3df6xkZX3H8fe37Aq2EPmxN7pd9nKhmhgxuOB1hUANISHlV+CPYrqkRUTNNoopVlsrmiCamIhNlSpG3ApF1Cr4syuFWFqwahuW7OKy/BK9KgYQ3AVkkarU1W//mLMwd5hzZ+7MmTt3znm/ksmeOeeZOd89s/dzn32ec85EZiJJqr/fG3cBkqSlYeBLUkMY+JLUEAa+JDWEgS9JDbFiXDtetWpVzszMjGv3kjSRtm3b9mhmTg3y2rEF/szMDFu3bh3X7iVpIkXETwZ9rUM6ktQQBr4kNYSBL0kNYeBLUkMY+JLUEH0HfkTsExHfjYjru2zbNyKujYi5iNgSETOVVilJGtpievgXAveWbHsj8PPMfDHwEeDSYQuTJFWrr/PwI+JQ4HTgA8DbuzQ5C7ikWP4ScHlERI7g3sv3PfIL/m3HT0u3n/CSKdYffnDVu5WkidfvhVeXAe8EDijZvgZ4ACAz90TEbuAQ4NH2RhGxEdgIMD09PUC5MLfzKT52y1zXbZlw648f57q/PG6g95akOusZ+BFxBrAzM7dFxInD7CwzNwGbAGZnZwfq/Z9+1GpOP+r0rtv+/FO38vRvfjd4gZJUY/2M4R8PnBkR9wNfAE6KiM92tHkIWAsQESuAFwCPVVhn3/z+LknqrmfgZ+ZFmXloZs4AG4CbM/MvOpptBs4rls8u2pi9krSMDHzztIh4P7A1MzcDVwKfiYg54HFavxiWXBD4e0aSultU4GfmN4FvFssXt63/NfDaKguTJFWrVlfaRoy7AklavmoV+OCkrSSVqV3gS5K6q13gO2crSd3VLvAlSd3VKvAjwjF8SSpRq8CXJJWrVeB7VqYklatV4APO2kpSiVoFvhdeSVK5WgU+eOGVJJWpXeBLkrqrVeAHDuFLUplaBb4kqVytAj+ctZWkUrUKfIB02laSuqpV4Nu/l6RytQp8cNJWksrULvAlSd3VKvAj7OFLUplaBb4kqVzNAt9pW0kqU7PA9146klSmVoHvdVeSVK5n4EfEfhFxW0TcERF3R8T7urR5fUTsiojtxeNNoym3t3TWVpK6WtFHm6eBkzLzqYhYCXwnIm7MzFs72l2bmW+tvkRJUhV6Bn62usxPFU9XFo9l2Y12REeSyvU1hh8R+0TEdmAncFNmbunS7E8jYkdEfCki1pa8z8aI2BoRW3ft2jV41ZKkResr8DPzt5m5DjgUWB8RL+9o8nVgJjOPAm4CPl3yPpsyczYzZ6empoYouzsnbSWp3KLO0snMJ4BbgFM61j+WmU8XTz8FvLKS6gbgnK0kddfPWTpTEXFgsfx84GTgex1tVrc9PRO4t8Ia+xaO4ktSqX7O0lkNfDoi9qH1C+K6zLw+It4PbM3MzcBfRcSZwB7gceD1oyq4F++HL0nd9XOWzg7g6C7rL25bvgi4qNrSJElVqt2Vto7hS1J3tQp8SVK5WgW+p2VKUrlaBT4s00uAJWkZqFXge1qmJJWrVeCDd8uUpDK1C3xJUnf1CvxwDF+SytQr8CVJpWoV+E7ZSlK5WgU+4JiOJJWoVeCHV15JUqlaBT7YwZekMrULfElSd7UK/MALrySpTK0CX5JUrlaB75ytJJWrVeCDk7aSVKZWgW8HX5LK1Srwwa84lKQytQt8SVJ3tQr8iCAdxZekrmoV+JKkcrUKfCdtJalcz8CPiP0i4raIuCMi7o6I93Vps29EXBsRcxGxJSJmRlJtH5y0laTu+unhPw2clJmvANYBp0TEsR1t3gj8PDNfDHwEuLTSKvtlF1+SSq3o1SBbN6d5qni6snh09qPPAi4plr8EXB4RkWO4sc3j//t//O0X7xj49RvWr+WVhx1cYUWStDz0DHyAiNgH2Aa8GPh4Zm7paLIGeAAgM/dExG7gEODRjvfZCGwEmJ6eHq7yLtbPHMytP3yM/557tHfjLh558tckGPiSaqmvwM/M3wLrIuJA4KsR8fLMvGuxO8vMTcAmgNnZ2cp7/xvWT7Nh/eC/SI7/4M0VViNJy8uiztLJzCeAW4BTOjY9BKwFiIgVwAuAxyqob8k56Suprvo5S2eq6NkTEc8HTga+19FsM3BesXw2cPM4xu8lSeX6GdJZDXy6GMf/PeC6zLw+It4PbM3MzcCVwGciYg54HNgwsopHzCt1JdVVP2fp7ACO7rL+4rblXwOvrbY0SVKVanWl7bD8AhVJdWbgd3JER1JNGfht7OFLqjMDv4MdfEl1ZeBLUkMY+G2CwMsHJNWVgS9JDWHgt3HSVlKdGfgdHNCRVFcGfhs7+JLqzMDv4JytpLoy8CWpIQz8NhHhGL6k2jLwJakhDPw2TtpKqjMDv4NX2kqqKwO/nV18STVm4Hewfy+prgx8SWoIA79NgF18SbVl4EtSQxj4bcLbZUqqMQO/QzqmI6mmDPw29u8l1VnPwI+ItRFxS0TcExF3R8SFXdqcGBG7I2J78bh4NOWOntddSaqrFX202QO8IzNvj4gDgG0RcVNm3tPR7tuZeUb1JUqSqtCzh5+ZD2fm7cXyL4B7gTWjLmwcIuzhS6qvRY3hR8QMcDSwpcvm4yLijoi4MSKOLHn9xojYGhFbd+3atfhqJUkD6zvwI2J/4MvA2zLzyY7NtwOHZeYrgI8BX+v2Hpm5KTNnM3N2ampqwJJHJ5y2lVRjfQV+RKykFfafy8yvdG7PzCcz86li+QZgZUSsqrTSJeJpmZLqqp+zdAK4Erg3Mz9c0uZFRTsiYn3xvo9VWehS8LorSXXWz1k6xwPnAndGxPZi3buBaYDMvAI4G3hzROwBfgVsyAm9sfxkVi1JvfUM/Mz8Dj2uScrMy4HLqypKklQ9r7TtYAdfUl0Z+JLUEAZ+G++WKanODPwOTtpKqisDv439e0l1ZuA/h118SfVk4EtSQxj4bbxbpqQ6M/AlqSEM/DYRjuBLqi8DX5IawsBv4/3wJdWZgd9hQm/yKUk9GfiS1BAGfhsnbSXVmYEvSQ1h4LcJvPBKUn0Z+JLUEAZ+O++HL6nGDPwOjuhIqisDX5IawsBv05q0tY8vqZ4MfElqCAO/jXO2kuqsZ+BHxNqIuCUi7omIuyPiwi5tIiI+GhFzEbEjIo4ZTbmSpEGt6KPNHuAdmXl7RBwAbIuImzLznrY2pwIvKR6vBj5R/DlR7OBLqrOegZ+ZDwMPF8u/iIh7gTVAe+CfBVyTrRnPWyPiwIhYXbx2otz50G7OvXLLuMt4jvOPn+Gkl75w3GVImmD99PCfEREzwNFAZyKuAR5oe/5gsW5e4EfERmAjwPT09CJLHb0zjvpDvr7jpzz19J5xlzLP3Q89ydQB+xr4kobSd+BHxP7Al4G3ZeaTg+wsMzcBmwBmZ2eX3fmPbzjhcN5wwuHjLuM5Trj05nGXIKkG+jpLJyJW0gr7z2XmV7o0eQhY2/b80GKdqrLsfj1KmjT9nKUTwJXAvZn54ZJmm4HXFWfrHAvsnsTxe0mqs36GdI4HzgXujIjtxbp3A9MAmXkFcANwGjAH/BI4v/JKG8wvZpFUhX7O0vkOPc5YLM7OuaCqoiRJ1fNK2wkQXiEgqQIG/oTwpm6ShmXgTwDv8SOpCgb+hLB/L2lYBr4kNYSBPwFaX8wy7iokTToDX5IawsCfABHhGL6koRn4ktQQBv4E8KxMSVUw8CeEF15JGpaBL0kNYeBPAu+WKakCBr4kNYSBPwEC7OJLGpqBL0kNYeBPgPB2mZIqYOBPiHRMR9KQDHxJaggDfwJ4t0xJVTDwJakhDPwJEGEPX9LwDHxJaggDfwKE98uUVIGegR8RV0XEzoi4q2T7iRGxOyK2F4+Lqy9TnpYpaVgr+mhzNXA5cM0Cbb6dmWdUUpEkaSR69vAz81vA40tQi0o4aSupClWN4R8XEXdExI0RcWRZo4jYGBFbI2Lrrl27Ktq1JKkfVQT+7cBhmfkK4GPA18oaZuamzJzNzNmpqakKdt0cdvAlDWvowM/MJzPzqWL5BmBlRKwaujJJUqWGDvyIeFEUt3OMiPXFez427PvqWd4tU1IVep6lExGfB04EVkXEg8B7gZUAmXkFcDbw5ojYA/wK2JB+43blPKKShtUz8DPznB7bL6d12qYkaRnzStsJ0BrQsYsvaTgGviQ1hIE/AbzwSlIVDHxJaggDfwJ4VqakKhj4E8IRHUnDMvAngPfDl1QFA39CeC2bpGEZ+JLUEAb+BIhwDF/S8Ax8SWoIA38COGUrqQoG/oRwzlbSsAz8SeCVV5IqYOBPCDv4koZl4EtSQxj4EyDwwitJwzPwJakhDPwJ4JytpCoY+JLUEAb+BLCDL6kKBv6EcM5W0rAMfElqCAN/AkQE6aVXkobUM/Aj4qqI2BkRd5Vsj4j4aETMRcSOiDim+jIlScPqp4d/NXDKAttPBV5SPDYCnxi+LLVz0lZSFVb0apCZ34qImQWanAVck61LQW+NiAMjYnVmPlxVkYLbf/IEJ3/4v8ZdhqQK/Nmr1vKmPz5iyffbM/D7sAZ4oO35g8W65wR+RGyk9b8ApqenK9h1M5x73GF84+5Hxl2GpIqs2n/fsey3isDvW2ZuAjYBzM7OOgvZp7PWreGsdWvGXYakCVfFWToPAWvbnh9arJMkLSNVBP5m4HXF2TrHArsdv5ek5afnkE5EfB44EVgVEQ8C7wVWAmTmFcANwGnAHPBL4PxRFStJGlw/Z+mc02N7AhdUVpEkaSS80laSGsLAl6SGMPAlqSEMfElqiBjXl2NHxC7gJwO+fBXwaIXlVMnaFm+51gXWNojlWhfUo7bDMnNqkB2MLfCHERFbM3N23HV0Y22Lt1zrAmsbxHKtC6zNIR1JaggDX5IaYlIDf9O4C1iAtS3ecq0LrG0Qy7UuaHhtEzmGL0lavEnt4UuSFsnAl6SmyMyJetD6ft37aN2d810j3M/9wJ3AdmBrse5g4CbgB8WfBxXrA/hoUdMO4Ji29zmvaP8D4Ly29a8s3n+ueG0sUMtVwE7grrZ1I6+lbB991HYJre9E2F48TmvbdlGxn/uAP+n1uQKHA1uK9dcCzyvW71s8nyu2z3TUtRa4BbgHuBu4cLkctwVqG+txA/YDbgPuKOp63xDvVUm9fdR2NfDjtmO2bkw/B/sA3wWuXy7HrGuWjCowR/EoDuoPgSOA5xUf/stGtK/7gVUd6z6094AD7wIuLZZPA24s/pEdC2xp+4fyo+LPg4rlvQFzW9E2iteeukAtrwGOYX6ojryWsn30UdslwN90afuy4jPbt/jH+sPiMy39XIHrgA3F8hXAm4vltwBXFMsbgGs79rWa4occOAD4frH/sR+3BWob63Er/h77F8sraYXJsYt9ryrr7aO2q4Gzuxyzpf45eDvwLzwb+GM/Zl2zZBRhOaoHcBzwjbbnFwEXjWhf9/PcwL8PWN32Q3tfsfxJ4JzOdsA5wCfb1n+yWLca+F7b+nntSuqZYX6ojryWsn30UdsldA+ueZ8X8I3iM+36uRY/eI8CKzo//72vLZZXFO0W+l/SvwInL6fj1qW2ZXPcgN8Hbgdevdj3qrLekuPVXtvVdA/8Jfs8aX3L338CJwHXD3L8R33M9j4mbQy/7AvTRyGBf4+IbcWXrwO8MJ/9Nq9HgBf2qGuh9Q92Wb8YS1FL2T768daI2BERV0XEQQPWdgjwRGbu6VLbM68ptu8u2j9HRMwAR9PqFS6r49ZRG4z5uEXEPhGxndYw3U20epeLfa8q620/VvNqy8y9x+wDxTH7SETs/Xbwpfw8LwPeCfyueD7I8R/JMes0aYG/lE7IzGOAU4ELIuI17Ruz9Ws1x1JZh6WoZZH7+ATwR8A64GHgH0ZUVk8RsT/wZeBtmflk+7ZxH7cutY39uGXmbzNzHa1e63rgpUtdQ5nO2iLi5bR6uy8FXkVrmObvRlzDvM8zIs4AdmbmtlHutyqTFvhL9oXpmflQ8edO4Ku0/vH/LCJWAxR/7uxR10LrD+2yfjGWopayfSwoM39W/HD+DvgnWsdukNoeAw6MiBUd6+e9V7H9BUX7Z0TESlqB+rnM/EqPv9OSHrdutS2X41bU8gStieXjBnivKut9jrbaTsnMh7PlaeCfGfyYDfp5Hg+cGRH3A1+gNazzjwv8fcZyzJ7Ra8xnOT1ojXn9iNakxt4JjCNHsJ8/AA5oW/4fWjPlf8/8yZsPFcunM3+C6LZi/cG0ziA4qHj8GDi42NY5QXRaj5pmmD9OPvJayvbRR22r25b/GvhCsXwk8yemfkRrUqr0cwW+yPyJqbcUyxcwf/Lruo6aArgGuKxj/diP2wK1jfW4AVPAgcXy84FvA2cs9r2qrLeP2la3HdPLgA+O8efgRJ6dtB37MetaY9VhOeoHrdn379MaW3zPiPZxRHFg954C9p5i/SG0Jmd+APxH2z+UAD5e1HQnMNv2Xm+gddrUHHB+2/pZ4K7iNZez8ITj52n9F/83tMbq3rgUtZTto4/aPlPsewewmflB9p5iP/fRdmZS2edafBa3FTV/Edi3WL9f8Xyu2H5ER10n0Pqv9w7aTnNcDsdtgdrGetyAo2idWrij+HtdPMR7VVJvH7XdXByzu4DP8uyZPEv6c1C0O5FnA3/sx6zbw1srSFJDTNoYviRpQAa+JDWEgS9JDWHgS1JDGPiS1BAGviQ1hIEvSQ3x/4tppPoWqYdUAAAAAElFTkSuQmCC\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(union_item['count'].values[40000:])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "大概有75000个pair至少共现一次"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 新闻文章信息"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 44,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#不同类型的新闻出现的次数\n",
+ "plt.plot(user_click_merge['category_id'].value_counts().values)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 45,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#出现次数比较少的新闻类型, 有些新闻类型,基本上就出现过几次\n",
+ "plt.plot(user_click_merge['category_id'].value_counts().values[150:])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "count 1.630633e+06\n",
+ "mean 2.043012e+02\n",
+ "std 6.382198e+01\n",
+ "min 0.000000e+00\n",
+ "25% 1.720000e+02\n",
+ "50% 1.970000e+02\n",
+ "75% 2.290000e+02\n",
+ "max 6.690000e+03\n",
+ "Name: words_count, dtype: float64"
+ ]
+ },
+ "execution_count": 46,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#新闻字数的描述性统计\n",
+ "user_click_merge['words_count'].describe()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 47,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 47,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(user_click_merge['words_count'].values)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户点击的新闻类型的偏好\n",
+ "\n",
+ "此特征可以用于度量用户的兴趣是否广泛。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 48,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 48,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD4CAYAAAD8Zh1EAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUlUlEQVR4nO3dfZBc1Xnn8e8zM3pBaCwkNBJCAiQbsKwEy8CYwoEihTG2wXGwY5dDditWHGrZsp3EjpNdw9q1dtXGu3YqNvFWsomJIaESyoGAMSQFwRhjezeJJY+MAAsEEuJFEnoZAXpBGAlJZ//oK2UkzfRtzfR097nz/VRNze3Tt/s+Z27rp9unT98bKSUkSfnrancBkqTmMNAlqSIMdEmqCANdkirCQJekiuhp5cZmz56dFi5c2MpNSlL2Vq5cuT2l1Fe2XksDfeHChQwMDLRyk5KUvYh4rpH1HHKRpIow0CWpIgx0SaoIA12SKsJAl6SKMNAlqSIMdEmqiCwC/a6HN/J3P25oGqYkTVhZBPo9q17g9oEN7S5DkjpaFoEuSSpnoEtSRWQT6F4pT5LqyyLQI6LdJUhSx8si0CVJ5Qx0SaqIbAI94SC6JNWTRaA7gi5J5bIIdElSOQNdkioim0B3Hrok1ZdFoDsNXZLKZRHokqRy2QS6Qy6SVF8mge6YiySVySTQJUllDHRJqohsAt0hdEmqL4tAd9qiJJXLItAlSeUMdEmqiGwCPTkRXZLqyiLQHUKXpHJZBLokqZyBLkkVYaBLUkVkEejOQ5ekclkEuiSpXEOBHhG/HxGrI+JnEfGtiJgaEYsiYnlErIuI2yJi8ngXK0kaWWmgR8R84PeA/pTSLwLdwNXAV4AbUkpnAi8D14xnoU5Dl6T6Gh1y6QFOiIgeYBqwGXgncEdx/y3AB5peXSGciS5JpUoDPaW0CfgT4HlqQb4TWAnsSCntL1bbCMwf7vERcW1EDETEwODgYHOqliQdo5Ehl5nAVcAi4FTgROC9jW4gpXRjSqk/pdTf19c36kIlSfU1MuTyLuCZlNJgSul14NvARcBJxRAMwAJg0zjVCEDyjOiSVFcjgf48cGFETIuIAC4DHgceAj5crLMMuHt8SnQeuiQ1opEx9OXUPvz8KfBY8Zgbgc8Cn4mIdcDJwE3jWKckqURP+SqQUvoC8IWjmtcDFzS9IknSqGTzTVHnoUtSfVkEumPoklQui0CXJJUz0CWpIrIJdIfQJam+LALdc7lIUrksAl2SVC6bQE/OW5SkuvIIdEdcJKlUHoEuSSploEtSRWQT6I6gS1J9WQS6Q+iSVC6LQJcklTPQJaki8gl0B9Elqa4sAj08f64klcoi0CVJ5Qx0SaqIbALdIXRJqi+LQHcEXZLKZRHokqRyBrokVUQ2ge750CWpviwC3WnoklQui0CXJJUz0CWpIrIJdEfQJam+LALdIXRJKpdFoEuSyhnoklQR2QS609Alqb4sAt3zoUtSuYYCPSJOiog7ImJNRDwREe+IiFkR8UBErC1+zxzvYiVJI2v0CP3rwD+nlBYDS4EngOuAB1NKZwEPFrclSW1SGugRMQO4BLgJIKW0L6W0A7gKuKVY7RbgA+NTYk1yJrok1dXIEfoiYBD464h4OCK+GREnAnNTSpuLdbYAc4d7cERcGxEDETEwODg4qiIdQZekco0Eeg9wHvAXKaVzgT0cNbySaqdCHPYQOqV0Y0qpP6XU39fXN9Z6JUkjaCTQNwIbU0rLi9t3UAv4rRExD6D4vW18Sqxx2qIk1Vca6CmlLcCGiHhz0XQZ8DhwD7CsaFsG3D0uFYJjLpLUgJ4G1/td4NaImAysBz5G7T+D2yPiGuA54CPjU6IkqRENBXpKaRXQP8xdlzW1GknSqGXxTVFwDF2SymQR6OEguiSVyiLQJUnlDHRJqggDXZIqIotA9+y5klQui0CXJJUz0CWpIrIJ9OREdEmqK4tAdwhdksplEeiSpHIGuiRVRDaB7gi6JNWXRaA7D12SymUR6JKkcga6JFVENoHuNHRJqi+LQPd86JJULotAlySVM9AlqSKyCfTkTHRJqiuLQHceuiSVyyLQJUnlsgl0py1KUn1ZBLpDLpJULotAlySVM9AlqSKyCXSH0CWpvkwC3UF0SSqTSaBLkspkE+hOW5Sk+rII9NcPHGT7K3vbXYYkdbQsAv3nrx+gd2pPu8uQpI7WcKBHRHdEPBwR/1TcXhQRyyNiXUTcFhGTx6vIOb1TnOYiSSWO5wj9U8ATQ25/BbghpXQm8DJwTTMLG6o7ggMOoktSXQ0FekQsAN4HfLO4HcA7gTuKVW4BPjAO9QHQ3RUcOGigS1I9jR6h/ynwX4GDxe2TgR0ppf3F7Y3A/OEeGBHXRsRARAwMDg6Orsiu4KBH6JJUV2mgR8SvANtSSitHs4GU0o0ppf6UUn9fX99onqI25OIRuiTV1cjUkYuAX42IK4GpwBuArwMnRURPcZS+ANg0XkV2BZjnklRf6RF6Sun6lNKClNJC4Grg+yml/wg8BHy4WG0ZcPe4FdlV++r/QVNdkkY0lnnonwU+ExHrqI2p39Scko7VXZwQ3ZkukjSy4/q2TkrpB8APiuX1wAXNL+lYh47QDxxMTOpuxRYlKT9ZfFN0589fB2Dv/oMla0rSxJVFoJ86YyqAM10kqY4sAr2nu1bm/oMeoUvSSPII9GIMff8Bj9AlaSR5BPqhI3QDXZJGlEegHzpCd8hFkkaURaDvO1AL8hf37GtzJZLUubII9FNnnAD4TVFJqieLQJ86qVbmoSN1SdKxsgj0ScWHovv8YpEkjSiLQO/prn0o+tyLr7a5EknqXFkE+uzpUwCYMimLciWpLbJIyCk9DrlIUpksAn2ygS5JpfII9OJD0Uc27mhvIZLUwbII9ENf/T8020WSdKxsEnLJvDfw1NZX2l2GJHWsbAJ9z779nDjZyxVJ0kiyCfSz5/by6Kad7S5DkjpWNoH+2usHiHYXIUkdLJtAP/f0mezdf9ATdEnSCLIJ9JRqQf7Czp+3uRJJ6kzZBPpb5r0BgMHde9tciSR1pmwCfcYJkwBYs2V3myuRpM6UTaAvPqUXgE0vO+QiScPJJtB7p9aO0Fc8+1KbK5GkzpRNoE/u6WLxKb28+Ipj6JI0nGwCHeCkaZN4enAPe/bub3cpktRxsgr0S87uA+ClPfvaXIkkdZ6sAv3MvukA3PaTDW2uRJI6T1aB/stvrh2hv+KQiyQdI6tAn9LTTV/vFP7mX5/l1X2GuiQNlVWgA1z0ppMBvzEqSUcrDfSIOC0iHoqIxyNidUR8qmifFREPRMTa4vfM8S8XrjhnHgB/8t2nWrE5ScpGI0fo+4E/SCktAS4EPhkRS4DrgAdTSmcBDxa3x92Fb6wdoT+7fU8rNidJ2SgN9JTS5pTST4vl3cATwHzgKuCWYrVbgA+MU41HmHHCJN6/9FQe27STex/b3IpNSlIWjmsMPSIWAucCy4G5KaVDiboFmDvCY66NiIGIGBgcHBxLrYf92rnzAfju6i1NeT5JqoKGAz0ipgN3Ap9OKe0ael+qnax82CtPpJRuTCn1p5T6+/r6xlTsIZcunsPiU3r5zqoXWPGM53aRJGgw0CNiErUwvzWl9O2ieWtEzCvunwdsG58Sh/f+pacCcNfDm1q5WUnqWI3McgngJuCJlNLXhtx1D7CsWF4G3N388kb2yUvPZOHJ01i+/kV+8GRL/y+RpI7UyBH6RcBvAu+MiFXFz5XAl4HLI2It8K7idktddOZsNrz8Kv/noadbvWlJ6jg9ZSuklP4fECPcfVlzyzk+X/rgOWzbvZdHNuzgtp88z0f6T6P2hkKSJp7svil6tCXz3sC23Xv57J2P8cLO19pdjiS1TfaB/vuXn82f/YdzAbhz5UbWbNlV8ghJqqbsAx3gjFknAvC1B57iD25/pM3VSFJ7VCLQz1kwg4HPv4srfvEUtu56jR89NcimHV5MWtLEUolAB5g9fQoLZ5/I9lf28dGbV/Cfbhlod0mS1FKVCXSAT112Fnd+/Je4fMlctu56jSe37Gb94CvUvsgqSdVWqUCfOqmb88+Yydlzp/Pinn28509/xDu/+kP+8VFP4iWp+krnoefo2kvexDnzZ/Da6wf59G2reGZwDzte3UcQzJg2qd3lSdK4iFYOR/T396eBgdaNbaeUOPvz9/H6gX/v4/VXLOY///KbWlaDJI1VRKxMKfWXrVfJI/RDIoIbP9p/+GIYNzzwFOsHvTCGpGqqdKADXPrmOfDm2vKty5/nH1Zu4DuramdojIAvvv8XuPqC09tYoSQ1R+UDfajPXfkWfvzMi4dv/92/PccjG3dy9QVtLEqSmmRCBfqli+dw6eI5h28/sHord6/axL+s2364racr+J+/ds7ha5dKUi4mVKAf7ROXnnlEmKeU+M6qFxh49iUDXVJ2Kj3LZTTO/tx9zJ4+mQUzpx3R3tUF/+U9izn/jJltqkzSRNXoLJdKfbGoGX7rooWccfKJdHfFET8/Xv8SP/TKSJI62IQechnOf7vyLcO2n/OF+/mnxzazfvvw0x7n9E7l8+97C11dXmBDUnsY6A1631vnseLZl3h887HnW9/92n4Gd+/lYxct5LRZ04Z5tCSNPwO9QV/+0FtHvO/exzbziVt/yg3fe4qZ0ybXfZ63LpjBVW+b3+zyJMlAb4az5/Yye/oUvrt6a9319u4/QO/USQa6pHFhoDfBmXOmM/D5d5Wu9+X71vDN/7uev/rR+oafu6sr+NWlp9LXO2UsJUqaAAz0FjprznT2H0x86d4njutxr+7dz+9edtY4VSWpKgz0FvrQ+Qu44pxTOHgcU//f/kff4+ENO7i7OP/MaJw+axrnnu78eanqDPQWmzb5+P7k82eewPfXbOP7a0Y/B35KTxdr/sd7iXBKpVRlBnqHu+sTv8S23XtH/fjbBzbwjR+u50drtzOlp3nfI+vuCpYuOInJTXxOSWNjoHe43qmT6J06+qssLT6lF4BlN69oVkmH/fdfWcJvX7yo6c8raXQM9Ip7/1tP5bSZ09h34GBTn3fZzStYu+2VwxcPaYepk7o5ZcbUtm1f6jQGesX1dHfRv3BW05931omT+daK5/nWiueb/tzH486Pv4Pzz2h+/6QcGegalZuWvZ2123a3bfvbdu3lf923hme2v8qSeTPaVsfRpk7q8sNntY2nz1WWtu1+jQu+9GC7yzjGh85bwFc/srTdZahivEi0Km1O71Ru+PWlbN01+hlAzXbHyo08tbV971okA13Z+uC5C9pdwhEef2EX//joC5zzxfvbXcqE9JnLz+ZjF03sWVcGutQk11y8iJOn1z/bpsbHXQ9vYuC5lw30sTw4It4LfB3oBr6ZUvpyU6qSMrT0tJNYetpJ7S5jQlr53Mv8YM02Lv/aD9tdyohuWvZ2Tj95fK+XMOpAj4hu4M+By4GNwE8i4p6U0uPNKk6SGnHNxYu4f/WWdpdRVyu+VT2WI/QLgHUppfUAEfH3wFWAgS6ppa5623yvM8DYLhI9H9gw5PbGou0IEXFtRAxExMDg4OAYNidJqmfc3wOklG5MKfWnlPr7+vrGe3OSNGGNJdA3AacNub2gaJMktcFYAv0nwFkRsSgiJgNXA/c0pyxJ0vEa9YeiKaX9EfE7wP3Upi3enFJa3bTKJEnHZUzz0FNK9wL3NqkWSdIYeLkZSaoIA12SKqKlp8+NiEHguVE+fDawvYnl5MA+Twz2ufrG2t8zUkql875bGuhjEREDjZwPuErs88Rgn6uvVf11yEWSKsJAl6SKyCnQb2x3AW1gnycG+1x9LelvNmPokqT6cjpClyTVYaBLUkVkEegR8d6IeDIi1kXEde2u53hFxLMR8VhErIqIgaJtVkQ8EBFri98zi/aIiP9d9PXRiDhvyPMsK9ZfGxHLhrSfXzz/uuKx0YY+3hwR2yLiZ0Paxr2PI22jjX3+YkRsKvb1qoi4csh91xf1PxkR7xnSPuzruzjx3fKi/bbiJHhExJTi9rri/oUt6u9pEfFQRDweEasj4lNFe2X3c50+d+Z+Til19A+1E389DbwRmAw8Aixpd13H2YdngdlHtf0xcF2xfB3wlWL5SuA+IIALgeVF+yxgffF7ZrE8s7hvRbFuFI+9og19vAQ4D/hZK/s40jba2OcvAn84zLpLitfuFGBR8Zrurvf6Bm4Hri6W/xL4eLH8CeAvi+Wrgdta1N95wHnFci/wVNGvyu7nOn3uyP3c0n/0o/yDvgO4f8jt64Hr213XcfbhWY4N9CeBeUNeNE8Wy98AfuPo9YDfAL4xpP0bRds8YM2Q9iPWa3E/F3JkuI17H0faRhv7PNI/9CNet9TOUvqOkV7fRaBtB3qK9sPrHXpssdxTrBdt2N93U7umcOX38zB97sj9nMOQS0OXuutwCfhuRKyMiGuLtrkppc3F8hZgbrE8Un/rtW8cpr0TtKKPI22jnX6nGGK4ecjQwPH2+WRgR0pp/1HtRzxXcf/OYv2WKd7+nwssZ4Ls56P6DB24n3MI9Cq4OKV0HnAF8MmIuGTonan2X3Cl54+2oo8d8nf8C+BNwNuAzcBX21rNOIiI6cCdwKdTSruG3lfV/TxMnztyP+cQ6Nlf6i6ltKn4vQ24C7gA2BoR8wCK39uK1Ufqb732BcO0d4JW9HGkbbRFSmlrSulASukg8FfU9jUcf59fBE6KiJ6j2o94ruL+GcX64y4iJlELtltTSt8umiu9n4frc6fu5xwCPetL3UXEiRHRe2gZeDfwM2p9OPTp/jJqY3MU7R8tZghcCOws3mreD7w7ImYWb+/eTW2sbTOwKyIuLGYEfHTIc7VbK/o40jba4lDoFD5IbV9Drc6ri5kLi4CzqH0AOOzruzgKfQj4cPH4o/9+h/r8YeD7xfrjqvjb3wQ8kVL62pC7KrufR+pzx+7ndnywMIoPIq6k9uny08Dn2l3Pcdb+RmqfaD8CrD5UP7WxsAeBtcD3gFlFewB/XvT1MaB/yHP9NrCu+PnYkPb+4gX1NPBntOcDsm9Re+v5OrVxwGta0ceRttHGPv9t0adHi3+Q84as/7mi/icZMhNppNd38dpZUfwt/gGYUrRPLW6vK+5/Y4v6ezG1oY5HgVXFz5VV3s91+tyR+9mv/ktSReQw5CJJaoCBLkkVYaBLUkUY6JJUEQa6JFWEgS5JFWGgS1JF/H85cMkmMcaqfgAAAABJRU5ErkJggg==\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 49,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " category_id \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 250000.000000 \n",
+ " 250000.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 124999.500000 \n",
+ " 4.573188 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 72168.927986 \n",
+ " 4.419800 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 0.000000 \n",
+ " 1.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 62499.750000 \n",
+ " 2.000000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 124999.500000 \n",
+ " 3.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 187499.250000 \n",
+ " 6.000000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 249999.000000 \n",
+ " 95.000000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id category_id\n",
+ "count 250000.000000 250000.000000\n",
+ "mean 124999.500000 4.573188\n",
+ "std 72168.927986 4.419800\n",
+ "min 0.000000 1.000000\n",
+ "25% 62499.750000 2.000000\n",
+ "50% 124999.500000 3.000000\n",
+ "75% 187499.250000 6.000000\n",
+ "max 249999.000000 95.000000"
+ ]
+ },
+ "execution_count": 49,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户查看文章的长度的分布\n",
+ "\n",
+ "通过统计不同用户点击新闻的平均字数,这个可以反映用户是对长文更感兴趣还是对短文更感兴趣。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 50,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 50,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从上图中可以发现有一小部分人看的文章平均词数非常高,也有一小部分人看的平均文章次数非常低。\n",
+ "\n",
+ "大多数人偏好于阅读字数在200-400字之间的新闻。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 51,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 51,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "#挑出大多数人的区间仔细看看\n",
+ "plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "可以发现大多数人都是看250字以下的文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 52,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " count \n",
+ " 250000.000000 \n",
+ " 250000.000000 \n",
+ " \n",
+ " \n",
+ " mean \n",
+ " 124999.500000 \n",
+ " 205.830189 \n",
+ " \n",
+ " \n",
+ " std \n",
+ " 72168.927986 \n",
+ " 47.174030 \n",
+ " \n",
+ " \n",
+ " min \n",
+ " 0.000000 \n",
+ " 8.000000 \n",
+ " \n",
+ " \n",
+ " 25% \n",
+ " 62499.750000 \n",
+ " 187.500000 \n",
+ " \n",
+ " \n",
+ " 50% \n",
+ " 124999.500000 \n",
+ " 202.000000 \n",
+ " \n",
+ " \n",
+ " 75% \n",
+ " 187499.250000 \n",
+ " 217.750000 \n",
+ " \n",
+ " \n",
+ " max \n",
+ " 249999.000000 \n",
+ " 3434.500000 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id words_count\n",
+ "count 250000.000000 250000.000000\n",
+ "mean 124999.500000 205.830189\n",
+ "std 72168.927986 47.174030\n",
+ "min 0.000000 8.000000\n",
+ "25% 62499.750000 187.500000\n",
+ "50% 124999.500000 202.000000\n",
+ "75% 187499.250000 217.750000\n",
+ "max 249999.000000 3434.500000"
+ ]
+ },
+ "execution_count": 52,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#更加详细的参数\n",
+ "user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 用户点击新闻的时间分析"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 53,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#为了更好的可视化,这里把时间进行归一化操作\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "mm = MinMaxScaler()\n",
+ "user_click_merge['click_timestamp'] = mm.fit_transform(user_click_merge[['click_timestamp']])\n",
+ "user_click_merge['created_at_ts'] = mm.fit_transform(user_click_merge[['created_at_ts']])\n",
+ "\n",
+ "user_click_merge = user_click_merge.sort_values('click_timestamp')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 54,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " rank \n",
+ " click_cnts \n",
+ " category_id \n",
+ " created_at_ts \n",
+ " words_count \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 18 \n",
+ " 249990 \n",
+ " 162300 \n",
+ " 0.000000 \n",
+ " 4 \n",
+ " 3 \n",
+ " 20 \n",
+ " 1 \n",
+ " 25 \n",
+ " 2 \n",
+ " 5 \n",
+ " 5 \n",
+ " 281 \n",
+ " 0.989186 \n",
+ " 193 \n",
+ " \n",
+ " \n",
+ " 2 \n",
+ " 249998 \n",
+ " 160974 \n",
+ " 0.000002 \n",
+ " 4 \n",
+ " 1 \n",
+ " 12 \n",
+ " 1 \n",
+ " 13 \n",
+ " 2 \n",
+ " 5 \n",
+ " 5 \n",
+ " 281 \n",
+ " 0.989092 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 30 \n",
+ " 249985 \n",
+ " 160974 \n",
+ " 0.000003 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 8 \n",
+ " 2 \n",
+ " 8 \n",
+ " 8 \n",
+ " 281 \n",
+ " 0.989092 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ " 50 \n",
+ " 249979 \n",
+ " 162300 \n",
+ " 0.000004 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 25 \n",
+ " 2 \n",
+ " 2 \n",
+ " 2 \n",
+ " 281 \n",
+ " 0.989186 \n",
+ " 193 \n",
+ " \n",
+ " \n",
+ " 25 \n",
+ " 249988 \n",
+ " 160974 \n",
+ " 0.000004 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 21 \n",
+ " 2 \n",
+ " 17 \n",
+ " 17 \n",
+ " 281 \n",
+ " 0.989092 \n",
+ " 259 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id click_article_id click_timestamp click_environment \\\n",
+ "18 249990 162300 0.000000 4 \n",
+ "2 249998 160974 0.000002 4 \n",
+ "30 249985 160974 0.000003 4 \n",
+ "50 249979 162300 0.000004 4 \n",
+ "25 249988 160974 0.000004 4 \n",
+ "\n",
+ " click_deviceGroup click_os click_country click_region \\\n",
+ "18 3 20 1 25 \n",
+ "2 1 12 1 13 \n",
+ "30 1 17 1 8 \n",
+ "50 1 17 1 25 \n",
+ "25 1 17 1 21 \n",
+ "\n",
+ " click_referrer_type rank click_cnts category_id created_at_ts \\\n",
+ "18 2 5 5 281 0.989186 \n",
+ "2 2 5 5 281 0.989092 \n",
+ "30 2 8 8 281 0.989092 \n",
+ "50 2 2 2 281 0.989186 \n",
+ "25 2 17 17 281 0.989092 \n",
+ "\n",
+ " words_count \n",
+ "18 193 \n",
+ "2 259 \n",
+ "30 259 \n",
+ "50 193 \n",
+ "25 259 "
+ ]
+ },
+ "execution_count": 54,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "user_click_merge.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 55,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def mean_diff_time_func(df, col):\n",
+ " df = pd.DataFrame(df, columns={col})\n",
+ " df['time_shift1'] = df[col].shift(1).fillna(0)\n",
+ " df['diff_time'] = abs(df[col] - df['time_shift1'])\n",
+ " return df['diff_time'].mean()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 点击时间差的平均值\n",
+ "mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'click_timestamp'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 57,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(mean_diff_click_time.values, reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从上图可以发现不同用户点击文章的时间差是有差异的"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 前后点击文章的创建时间差的平均值\n",
+ "mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp', 'created_at_ts'].apply(lambda x: mean_diff_time_func(x, 'created_at_ts'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[]"
+ ]
+ },
+ "execution_count": 59,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMywgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/Il7ecAAAACXBIWXMAAAsTAAALEwEAmpwYAAAbj0lEQVR4nO3deXAc53nn8e8zMxhcJAji4CESFE8djKylZFhHpNURK7o2K27KiU1mXdYmWmvltTbeko+S40RWKZXK2pt1alNmbNO2NmtvYlmHD65MLdfSSrYsx5IgiSJFUpQg6iAoHiDBmyCAmXn2j2mAAxAEhuTMNND9+1ShpuftF9NPY8Afmm/3vG3ujoiIREsi7AJERKT0FO4iIhGkcBcRiSCFu4hIBCncRUQiKBXWhltaWnz+/PlhbV5EZFJ66aWX9rp763j9Qgv3+fPn09HREdbmRUQmJTN7t5h+GpYREYkghbuISAQp3EVEImjccDezB81sj5m9dor1ZmZ/Z2adZrbBzC4tfZkiInI6ijly/wfg5jHW3wIsCb7uBL5x9mWJiMjZGDfc3f2XQM8YXZYD3/O83wCNZja7VAWKiMjpK8WY+xxge8HzrqDtJGZ2p5l1mFlHd3d3CTYtIiKjqeh17u6+GlgN0N7efkZzDXfuOcw/Pv8e85rqSJiRMDAzkokTy4PtyYQxq6GGyxc2l3Q/REQmulKE+w6greD53KCtLJ7asof/8dw7p/U9T95zDYtnTC1PQSIiE1Apwn0NcLeZPQRcDhx0950leN1R/YdrF/HxK85lIJsj55BzJ5fzE8vuuEM25/zijW6+vGYT+48NlKscEZEJadxwN7MfANcBLWbWBXwZqAJw928Ca4FbgU7gGPDH5Sp2UH11cX+Tdh86DkB/JlfOckREJpxxU9LdV46z3oFPl6yiEkqn8ueLFe4iEjeR/oTqYLj3KdxFJGYiHe7Vg0fuWYW7iMRLxMM9CUDfQDbkSkREKivS4V6VzO/eQPaMLqkXEZm0Ih3uCcs/5lzhLiLxEulwN8unu6JdROIm0uE+eOTuOnIXkZiJeLjn0z2XU7iLSLzEI9yV7SISM5EOd3RCVURiKtLhPjjmLiISNxEP98FhGR25i0i8xCTcQy5ERKTCIh3upjF3EYmpWIS7sl1E4ibS4a7r3EUkrmIR7op2EYmbiId7/lFj7iISN5EOd9PVMiISU5EOd8gfvWviMBGJm8iHu5lpWEZEYify4Z4/cg+7ChGRyop8uOeP3MOuQkSksiIf7hpzF5E4ikG4a8xdROIn8uFu6FJIEYmfyId7wkwnVEUkdiIf7mb6hKqIxE/kwz2RMJ1QFZHYiX6461JIEYmhyId7/oSq0l1E4qWocDezm81sq5l1mtm9o6yfZ2ZPm9krZrbBzG4tfalnxsw05a+IxM644W5mSWAVcAuwFFhpZktHdPtz4GF3vwRYAfx9qQs9U/oQk4jEUTFH7pcBne6+zd37gYeA5SP6ONAQLE8D3i9diWcnYUYuF3YVIiKVVUy4zwG2FzzvCtoK3Q983My6gLXAfxrthczsTjPrMLOO7u7uMyj39CV0KaSIxFCpTqiuBP7B3ecCtwLfN7OTXtvdV7t7u7u3t7a2lmjTY9PEYSISR8WE+w6greD53KCt0B3AwwDu/s9ADdBSigLPlmnMXURiqJhwfxFYYmYLzCxN/oTpmhF93gM+DGBmF5IP98qMu4wjoatlRCSGxg13d88AdwPrgC3kr4rZZGYPmNltQbfPAp80s1eBHwD/zifI4bLG3EUkjlLFdHL3teRPlBa23VewvBm4qrSllYY+oSoicRT5T6iiI3cRiaHIh3t+yl+Fu4jESwzCXTfIFpH4iUG46zZ7IhI/kQ93fYhJROIo8uGuicNEJI4iH+752+yFXYWISGVFPtx1tYyIxFHkw11j7iISR5EPd00/ICJxFINwN13nLiKxE/lw1w2yRSSOIh/uOnIXkTiKfLibxtxFJIYiH+46cheROIp+uCd05C4i8RP5cDc0cZiIxE/0w93QPVRFJHYiH+66zZ6IxFEMwl2zQopI/MQg3DXmLiLxE/lwN4NcLuwqREQqKwbhriN3EYmfyId7wsKuQESk8mIQ7jpyF5H4iUm4h12FiEhlRT7cNXGYiMRRDMJdE4eJSPxEPtxTCSOjayFFJGYiH+7pZIL+jMJdROIl+uGeUriLSPwUFe5mdrOZbTWzTjO79xR9Pmpmm81sk5n9U2nLPHMKdxGJo9R4HcwsCawCfhfoAl40szXuvrmgzxLgi8BV7r7fzGaUq+DTlU4l6M8q3EUkXoo5cr8M6HT3be7eDzwELB/R55PAKnffD+Due0pb5plLJxMMZJ2cLnYXkRgpJtznANsLnncFbYXOA84zs+fM7DdmdvNoL2Rmd5pZh5l1dHd3n1nFpymdyu+ijt5FJE5KdUI1BSwBrgNWAt82s8aRndx9tbu3u3t7a2triTY9tmqFu4jEUDHhvgNoK3g+N2gr1AWscfcBd38beIN82IduKNx1UlVEYqSYcH8RWGJmC8wsDawA1ozo8xPyR+2YWQv5YZptpSvzzKUV7iISQ+OGu7tngLuBdcAW4GF332RmD5jZbUG3dcA+M9sMPA183t33lavo06FwF5E4GvdSSAB3XwusHdF2X8GyA/cEXxNKTSoJQO9ANuRKREQqJ/KfUK2vzv/9OtqXCbkSEZHKiUG454/cj/bryF1E4iPy4V5TFQzLKNxFJEYiH+61Qbgf15i7iMRI9MM9rROqIhI/kQ/3KcEJ1UO9AyFXIiJSObEI93QyQc+x/rBLERGpmMiHu5nRVJ+m54jCXUTiI/LhDjBrWg3vH+wNuwwRkYqJRbi3NdWxvUfhLiLxEY9wn17L+wd6yWjaXxGJiViE+8LWKWRyzrs9x8IuRUSkImIR7hfOngrA+vcOhFuIiEiFxCPcZzUwY2o1azfuDLsUEZGKKGrK38kukTA+2t7G15/u5Pq/eYa6dBIzMIyEAWYEDyQKlg3LPxYsJyz/CPnLLBNG0N9OPA615b+PweXB58F6hn3f8NcZ7DBUwyg1DfUYbX3B6zPKuvxrnNjW4LrBTZ/0WoOvU7AuYfmfbSL4OQz+bIaeJ2zoZzTYNr+5nssXNpfuzRWRUcUi3AE+c8MSptSk2Nh1kOMDWRxwd3LO0DKAO+TccQcn/5jLgZML2grX578hV9B38PsHXwsK1gXb8cGiRrQVbnPY959q/VCf0V8/X0/wfIxtDa4/Va0+VHBpPHrXlbTPbyrti4rIMLEJ96pkgruuXRR2GZOeF/xhy7kP/aHLBX/kcu547sS6nJ/4I3qsP8PyVc/xgxe2K9xFyiw24S6lMTjsBJAcGvwp3rXntfKrzu4SVyUiI8XihKpMHMvaGtl9qI99R/rCLkUk0hTuUlGLZ0wB4M09R0KuRCTaFO5SUUtnNwCw+f1DIVciEm0Kd6mo1qnVNNSk2LZXR+4i5aRwl4oyM85trufdfZoKQqScFO5ScbOm1dB9WCdURcpJ4S4VN7Ohml2HjoddhkikKdyl4prqqznYO0AuV+KPvorIEIW7VFxDTQp3ONyXCbsUkchSuEvFTautAuDgsYGQKxGJLoW7VFxTfRqA/cd003KRclG4S8VND8K9R+EuUjYKd6m4prog3I8o3EXKpahwN7ObzWyrmXWa2b1j9PuImbmZtZeuRIma6UG4H+jVmLtIuYwb7maWBFYBtwBLgZVmtnSUflOBzwDPl7pIiZapNSnM4ICGZUTKppgj98uATnff5u79wEPA8lH6/SXwFUCfTpExJRJGY20VB3S1jEjZFBPuc4DtBc+7grYhZnYp0ObuPxvrhczsTjPrMLOO7m7dsCHOpten6TmqI3eRcjnrE6pmlgC+Bnx2vL7uvtrd2929vbW19Ww3LZNYU53CXaScign3HUBbwfO5QdugqcBFwDNm9g5wBbBGJ1VlLE31aV3nLlJGxYT7i8ASM1tgZmlgBbBmcKW7H3T3Fnef7+7zgd8At7l7R1kqlkhoqk+zT0fuImUzbri7ewa4G1gHbAEedvdNZvaAmd1W7gIlmqbXp9l/tB93TR4mUg6pYjq5+1pg7Yi2+07R97qzL0uirrk+TSbnHO7L0FBTFXY5IpGjT6hKKAY/yLRfQzMiZaFwl1AMTh6mcXeR8lC4SyhmTasBYMf+3pArEYkmhbuEYl5THQDv9ehG2SLloHCXUNRXp2iZkqZrv8JdpBwU7hKaudPrdOQuUiYKdwnNgpZ63th9RDfKFimDoq5zFymH37lgBj9+ZQfLVz3H7Gk1JBNGImEkzEgaBcv59mQCEha0JYxE0CcZtJ1YDtqD52bklxPGsrZGLp7bGPaui5Sdwl1C83sXz+b1XYd4rnMf7/UcI+dONue4QzZYzuWcXPA8vxy0O8Hj8LbxLGip5+nPXVf2fRMJm8JdQmNmfP6mC/j8TaV5PffhfxhG/pH4q7VbWPPq+7g7ZlaajYpMUAp3iQwLhmASGFXJk9efP3Mq/Zkch45nmFarKQ8k2nRCVWJjRkM1AHuP9IVciUj5KdwlNoamPDiiKQ8k+hTuEhvN9fkj956jOnKX6FO4S2xosjKJE4W7xIaGZSROFO4SG+lUgub6NDsPaiZKiT6Fu8RKW1Md23sU7hJ9CneJlbYmTVYm8aBwl1iZ11TL+wd6yWRzYZciUlYKd4mVtul1ZHLOzoPHwy5FpKwU7hIrg3eA2q6hGYk4hbvEysLWKQBs3nko5EpEyksTh0mszJpWw0VzGvj2s9tYMnMq1akEVckE6WSCdCpBVdKoSiZIJW1o7vhEMB+8Fc4jP2Jeec0yKRONwl1i56/+zQf4yDd+ze0PvlCy17Qg8AtvDpKw4TcUyf9xOPGHIZHInwP43p9cRiqp/0RLaSncJXb+RVsjv/jC9ew80Et/NsdA1unP5BjI5r/6MjlyOc/fIMQZdpOQwTnic15wI5Gc4z5af4ZuJjL43AuW39l3lF+/tY/dh/uY01gb9o9FIkbhLrE0p7E29EB9cvNu/v33OuhWuEsZ6P+CIiGZNa0GgF26LFPKQOEuEpK50/NH6zsOaDoEKT2Fu0hIptVWUZ9O0rVf19xL6SncRUJiZsydXkfXfh25S+kVFe5mdrOZbTWzTjO7d5T195jZZjPbYGZPmdm5pS9VJHrOba7jrT1Hwi5DImjccDezJLAKuAVYCqw0s6Ujur0CtLv7xcCjwFdLXahIFC2b18i2vUfZr7tDSYkVcynkZUCnu28DMLOHgOXA5sEO7v50Qf/fAB8vZZEiUfXBedMBeLhjO5cvbCaVOPEJ2XQyMbRclSz49GzwaVmRsRQT7nOA7QXPu4DLx+h/B/DEaCvM7E7gToB58+YVWaJIdC2b18jClnr++onXT+v7qpJGKnEi8Af/CKz4UBt3/86SMlUrk0lJP8RkZh8H2oFrR1vv7quB1QDt7e1eym2LTEbVqSSP/+nVvPLeAfoyWQayTibrZHI5+jM5MjkPPjmbf8xkc/RnnUw2N6Ld+VXnXn62cZfCXYDiwn0H0FbwfG7QNoyZ3QB8CbjW3ftKU55I9NWlU1y1uOWsX+cvfvIaP11/0j9NialirpZ5EVhiZgvMLA2sANYUdjCzS4BvAbe5+57Slyki45kzvZZDxzMcPj4QdikyAYwb7u6eAe4G1gFbgIfdfZOZPWBmtwXd/iswBXjEzNab2ZpTvJyIlMm5wY1I3t57NORKZCIoaszd3dcCa0e03VewfEOJ6xKR03TRnGkAvPB2DxfPbQy3GAmdZoUUiYi2pjoumtPA15/upHPPEVJJwzhxM5FUIri5SDC/fCJR0BbMNZ9MJEgaLJ4xlauXnP15AAmPwl0kQv77ikv4wqMbeHLLnqG543PO0PzzmYK56XNjXK9WlTQ23n8TNVXJyhUvJaVwF4mQRa1TeOxTv11UXw+CP5PLkcvlb0KSzTlPbdnNPQ+/yuu7DrOsrbG8BUvZaOIwkZgavCdsdSpJbTrJlOoU02qruHxhMwAbuw6EW6CcFYW7iAxzzrQamuvTbOg6GHYpchY0LCMiw5gZly9s4pGXunjhnZ6h+WxSBVMeJBNW0J5/XNQ6hc/eeJ7mvZkgFO4icpIHll/E/OZ6dhzoJRNMcZDNOQO5/NQHmaxzJJPJt2Wdg8f6eeK1XfzBB+cyv6U+7PIFhbuIjKJlSjVfuPmCovu/vfco1//NMzzbuVfhPkEo3EXkrM1vrmNOYy2PvtTFrIYaqlMJ0qnEsMfqVHJoeUp1ilRSp/zKSeEuImfNzLht2Tl845m3+OT3OsbtP7+5jqc/d53G58tI4S4iJfGFm85n5YfmcbB3gP5slr6BHH3ZHH0DOfqzOfoGsvRnc7z87gEee7mLt7qPsnjGlLDLjiyFu4iUhJkxr7lu3H7/cnErj73cxZ/9aCOXnjud2qokNVUJatNJaqqS1AZf7fOn01iXrkDl0aRwF5GKmtdcx723XMDqX25j/fYD9Gdzo/b7/Uvm8LcfW1bZ4iJE4S4iFXfXtYu469pFAGSyOY5nchwfyNLbn+X4QJb/8sTrPLllN4+91EV9dZLadIq6dP6Ivi6d5JzGWs17Mw6Fu4iEKpVMMCWZv4Jm0F3XLeLZ7+zls4+8Our3LJ3dwM/+9GqdkB2Dwl1EJpwPzW+i489vYN+Rfo71Z+jtz3KsP0vvQJbnt/Xw4HNv85mH1nNOYy3ppFFdlR+vv3pxC+fPmhp2+ROCwl1EJqSGmioaaqpOar96cQsbug7w8827yeZ82Jj9eTOn8JNPX0VVMkFVzK+jN/cxJnUuo/b2du/oGP96WBGRsbjnA/67v3qbr/6frUPtqYRRl05yw9KZfO2jy8IrsMTM7CV3bx+vn47cRWRSM8tPW/ypaxdx4ewGtuw8RDbr9A5kebXrAD96eQddPb1MrUlRm05y4ewGPn394rDLLjuFu4hEgplx/fkzuP78GUNtB3sH+MvHN/PuvqPsOnScLTsP8fiGnXS808OsaTVMqU7x4QtnckUwh32UaFhGRGLj+ECWv/jJa7z07n4O92XoPtxHOpngmvNaaahNccXCZj7a3hZ2mWMqdlhG4S4isfXm7sM88Phmug/38fquwwAsaq2neUo1H2tv44alM5lWe/JJ3TAp3EVETsP+o/38/TOd7DjQyy/f2MuRvgwAC1rq+diH2rh6cQsXzZkWcpUKdxGRM3a0L8Ozb3bzatdBHt/wPtt7egFork9zxaJmfu8Ds7n+ghmhfEpW4S4iUiK7Dh7nsZe7eOW9Azz1+m7coaEmxTXntbJ82RyuWtxMXboy16foUkgRkRKZNa1m6PLJg8cGeLazm3WbdrNu0y4e37CTdDLBrR+YxR+2t3HlwmYSifCnRdCRu4jIGertz/KLN/bwv1/dyc+37KY/k2NmQzUfuXQuf3T5POZOH38K5NOlYRkRkQo6dHyAJzbu5Kfr3+fXb+0jYXDTb83iU9ct4uK5jSXbjsJdRCQk23uO8U8vvMf/+ud3OdyX4YYLZ3DvLReW5M5TCncRkZAdOj7Ad365je/86m0yOef+f/1b/NHl887qNYsN93hPmyYiUkYNNVXcc+P5PPO567hyYTN/9uONPNKxvSLbLirczexmM9tqZp1mdu8o66vN7IfB+ufNbH7JKxURmaRmNNTw7U+089uLmvnymk0cH8iWfZvjhruZJYFVwC3AUmClmS0d0e0OYL+7Lwb+FvhKqQsVEZnM0qkEd1y9gGP9WdZvP1D27RVz5H4Z0Onu29y9H3gIWD6iz3LgfwbLjwIfNt3/SkRkmKXnNADwVveRsm+rmHCfAxQOEnUFbaP2cfcMcBA4aQ5NM7vTzDrMrKO7u/vMKhYRmaTq0in+1Qdm01aG699HqugnVN19NbAa8lfLVHLbIiJhm1Zbxap/e2lFtlXMkfsOoHCC47lB26h9zCwFTAP2laJAERE5fcWE+4vAEjNbYGZpYAWwZkSfNcDtwfIfAP/Pw7qAXkRExh+WcfeMmd0NrAOSwIPuvsnMHgA63H0N8F3g+2bWCfSQ/wMgIiIhKWrM3d3XAmtHtN1XsHwc+MPSliYiImdKn1AVEYkghbuISAQp3EVEIkjhLiISQaFN+Wtm3cC7Z/jtLcDeEpYzGWif40H7HA9ns8/nunvreJ1CC/ezYWYdxcxnHCXa53jQPsdDJfZZwzIiIhGkcBcRiaDJGu6rwy4gBNrneNA+x0PZ93lSjrmLiMjYJuuRu4iIjEHhLiISQZMu3Me7WfdEZ2bvmNlGM1tvZh1BW5OZ/dzM3gwepwftZmZ/F+zrBjO7tOB1bg/6v2lmtxe0fzB4/c7geyt+u0Mze9DM9pjZawVtZd/HU20jxH2+38x2BO/1ejO7tWDdF4P6t5rZTQXto/5+B1NuPx+0/zCYfjvUm9ObWZuZPW1mm81sk5l9JmiP7Hs9xj5PvPfa3SfNF/kph98CFgJp4FVgadh1neY+vAO0jGj7KnBvsHwv8JVg+VbgCcCAK4Dng/YmYFvwOD1Ynh6seyHoa8H33hLCPl4DXAq8Vsl9PNU2Qtzn+4HPjdJ3afC7Ww0sCH6nk2P9fgMPAyuC5W8CnwqW/yPwzWB5BfDDCu7zbODSYHkq8Eawb5F9r8fY5wn3Xlf0H30JfrBXAusKnn8R+GLYdZ3mPrzDyeG+FZhd8MuzNVj+FrByZD9gJfCtgvZvBW2zgdcL2of1q/B+zmd40JV9H0+1jRD3+VT/4If93pK/V8KVp/r9DoJtL5AK2of6DX5vsJwK+llI7/lPgd+Nw3s9yj5PuPd6sg3LFHOz7onOgf9rZi+Z2Z1B20x33xks7wJmBsun2t+x2rtGaZ8IKrGPp9pGmO4OhiAeLBg6ON19bgYOeP7m84Xtw17Lx7g5fbkFQwSXAM8Tk/d6xD7DBHuvJ1u4R8HV7n4pcAvwaTO7pnCl5/8sR/r61Ers4wT5OX4DWAQsA3YC/y3UasrEzKYAjwH/2d0PFa6L6ns9yj5PuPd6soV7MTfrntDcfUfwuAf4MXAZsNvMZgMEj3uC7qfa37Ha547SPhFUYh9PtY1QuPtud8+6ew74Nvn3Gk5/n/cBjZa/+Xxh+7DXshBuTm9mVeRD7h/d/UdBc6Tf69H2eSK+15Mt3Iu5WfeEZWb1ZjZ1cBm4EXiN4TcYv538OB5B+yeCqwyuAA4G/xVdB9xoZtOD//7dSH5cbidwyMyuCK4q+ETBa4WtEvt4qm2EYjB8Ar9P/r2GfJ0rgqsfFgBLyJ84HPX3OzgyfZr8zefh5J9fKDenD37+3wW2uPvXClZF9r0+1T5PyPc6jJMQZ3kC41byZ6jfAr4Udj2nWftC8mfFXwU2DdZPftzsKeBN4EmgKWg3YFWwrxuB9oLX+hOgM/j644L29uAX6y3g64Rwcg34Afn/mg6QHzO8oxL7eKpthLjP3w/2aUPwD3N2Qf8vBfVvpeCKplP9fge/Oy8EP4tHgOqgvSZ43hmsX1jBfb6a/HDIBmB98HVrlN/rMfZ5wr3Xmn5ARCSCJtuwjIiIFEHhLiISQQp3EZEIUriLiESQwl1EJIIU7iIiEaRwFxGJoP8P5wcWjQlff2gAAAAASUVORK5CYII=\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(sorted(mean_diff_created_time.values, reverse=True))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "从图中可以发现用户先后点击文章,文章的创建时间也是有差异的"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Defaulting to user installation because normal site-packages is not writeable\n",
+ "Looking in indexes: https://mirrors.aliyun.com/pypi/simple\n",
+ "Collecting gensim\n",
+ " Downloading https://mirrors.aliyun.com/pypi/packages/2b/e0/fa6326251692056dc880a64eb22117e03269906ba55a6864864d24ec8b4e/gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)\n",
+ "\u001b[K |████████████████████████████████| 24.2 MB 91.0 MB/s eta 0:00:01\n",
+ "\u001b[?25hRequirement already satisfied: six>=1.5.0 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.15.0)\n",
+ "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
+ "Requirement already satisfied: scipy>=0.18.1 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.5.4)\n",
+ "Requirement already satisfied: numpy>=1.11.3 in /opt/conda/lib/python3.6/site-packages (from gensim) (1.19.1)\n",
+ "Collecting smart-open>=1.8.1\n",
+ " Downloading https://mirrors.aliyun.com/pypi/packages/e3/cf/6311dfb0aff3e295d63930dea72e3029800242cdfe0790478e33eccee2ab/smart_open-4.0.1.tar.gz (117 kB)\n",
+ "\u001b[K |████████████████████████████████| 117 kB 96.7 MB/s eta 0:00:01\n",
+ "\u001b[?25hBuilding wheels for collected packages: smart-open\n",
+ " Building wheel for smart-open (setup.py) ... \u001b[?25ldone\n",
+ "\u001b[?25h Created wheel for smart-open: filename=smart_open-4.0.1-py3-none-any.whl size=108249 sha256=50eb67320a58790e8b173971aeb6af7b636d48259d7c9de759612e58e334215b\n",
+ " Stored in directory: /home/admin/.cache/pip/wheels/c3/14/fc/a0e523e5d2f13d083ce0af09d4e2861d8e2ec65fc466fb1dff\n",
+ "Successfully built smart-open\n",
+ "Installing collected packages: smart-open, gensim\n",
+ "Successfully installed gensim-3.8.3 smart-open-4.0.1\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 安装gensim\n",
+ "!pip install gensim"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 44,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from gensim.models import Word2Vec\n",
+ "import logging, pickle\n",
+ "\n",
+ "# 需要注意这里模型只迭代了一次\n",
+ "def trian_item_word2vec(click_df, embed_size=16, save_name='item_w2v_emb.pkl', split_char=' '):\n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " # 只有转换成字符串才可以进行训练\n",
+ " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
+ " # 转换成句子的形式\n",
+ " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
+ " docs = docs['click_article_id'].values.tolist()\n",
+ "\n",
+ " # 为了方便查看训练的进度,这里设定一个log信息\n",
+ " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
+ "\n",
+ " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
+ " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=10)\n",
+ " \n",
+ " # 保存成字典的形式\n",
+ " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
+ " \n",
+ " return item_w2v_emb_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 45,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "item_w2v_emb_dict = trian_item_word2vec(user_click_merge)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " user_id \n",
+ " click_article_id \n",
+ " click_timestamp \n",
+ " click_environment \n",
+ " click_deviceGroup \n",
+ " click_os \n",
+ " click_country \n",
+ " click_region \n",
+ " click_referrer_type \n",
+ " \n",
+ " \n",
+ " \n",
+ " \n",
+ " 25667 \n",
+ " 190841 \n",
+ " 199197 \n",
+ " 1507045276129 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 25668 \n",
+ " 190841 \n",
+ " 285298 \n",
+ " 1507045302920 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 25669 \n",
+ " 190841 \n",
+ " 156624 \n",
+ " 1507046638885 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 25670 \n",
+ " 190841 \n",
+ " 129029 \n",
+ " 1507046668885 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 20 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ " 107739 \n",
+ " 164226 \n",
+ " 214800 \n",
+ " 1507131402464 \n",
+ " 4 \n",
+ " 1 \n",
+ " 17 \n",
+ " 1 \n",
+ " 21 \n",
+ " 2 \n",
+ " \n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " user_id ... click_referrer_type\n",
+ "25667 190841 ... 2\n",
+ "25668 190841 ... 2\n",
+ "25669 190841 ... 2\n",
+ "25670 190841 ... 2\n",
+ "107739 164226 ... 2\n",
+ "\n",
+ "[5 rows x 9 columns]"
+ ]
+ },
+ "execution_count": 36,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# 随机选择5个用户,查看这些用户前后查看文章的相似性\n",
+ "sub_user_ids = np.random.choice(user_click_merge.user_id.unique(), size=15, replace=False)\n",
+ "sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]\n",
+ "\n",
+ "sub_user_info.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# 上一个版本,这个函数使用的是赛题提供的词向量,但是由于给出的embedding并不是所有的数据的embedding,所以运行下面画图函数的时候会报keyerror的错误\n",
+ "# 为了防止出现这个错误,这里修改为使用word2vec训练得到的词向量进行可视化\n",
+ "def get_item_sim_list(df):\n",
+ " sim_list = []\n",
+ " item_list = df['click_article_id'].values\n",
+ " for i in range(0, len(item_list)-1):\n",
+ " emb1 = item_w2v_emb_dict[str(item_list[i])] # 需要注意的是word2vec训练时候使用的是str类型的数据\n",
+ " emb2 = item_w2v_emb_dict[str(item_list[i+1])]\n",
+ " sim_list.append(np.dot(emb1,emb2)/(np.linalg.norm(emb1)*(np.linalg.norm(emb2))))\n",
+ " sim_list.append(0)\n",
+ " return sim_list"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 46,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "image/png": "\n",
+ "text/plain": [
+ ""
+ ]
+ },
+ "metadata": {
+ "needs_background": "light"
+ },
+ "output_type": "display_data"
+ }
],
- "text/plain": [
- " user_id ... click_referrer_type\n",
- "25667 190841 ... 2\n",
- "25668 190841 ... 2\n",
- "25669 190841 ... 2\n",
- "25670 190841 ... 2\n",
- "107739 164226 ... 2\n",
- "\n",
- "[5 rows x 9 columns]"
- ]
- },
- "execution_count": 36,
- "metadata": {},
- "output_type": "execute_result"
+ "source": [
+ "for _, user_df in sub_user_info.groupby('user_id'):\n",
+ " item_sim_list = get_item_sim_list(user_df)\n",
+ " plt.plot(item_sim_list)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "这里由于对词向量的训练迭代次数不是很多,所以看到的可视化结果不是很准确,可以训练更多次来观察具体的现象。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 总结\n",
+ "\n",
+ "通过数据分析的过程, 我们目前可以得到以下几点重要的信息, 这个对于我们进行后面的特征制作和分析非常有帮助:\n",
+ "1. 训练集和测试集的用户id没有重复,也就是测试集里面的用户模型是没有见过的\n",
+ "2. 训练集中用户最少的点击文章数是2, 而测试集里面用户最少的点击文章数是1\n",
+ "3. 用户对于文章存在重复点击的情况, 但这个都存在于训练集里面\n",
+ "4. 同一用户的点击环境存在不唯一的情况,后面做这部分特征的时候可以采用统计特征\n",
+ "5. 用户点击文章的次数有很大的区分度,后面可以根据这个制作衡量用户活跃度的特征\n",
+ "6. 文章被用户点击的次数也有很大的区分度,后面可以根据这个制作衡量文章热度的特征\n",
+ "7. 用户看的新闻,相关性是比较强的,所以往往我们判断用户是否对某篇文章感兴趣的时候, 在很大程度上会和他历史点击过的文章有关\n",
+ "8. 用户点击的文章字数有比较大的区别, 这个可以反映用户对于文章字数的区别\n",
+ "9. 用户点击过的文章主题也有很大的区别, 这个可以反映用户的主题偏好\n",
+ "10.不同用户点击文章的时间差也会有所区别, 这个可以反映用户对于文章时效性的偏好\n",
+ "\n",
+ "所以根据上面的一些分析,可以更好的帮助我们后面做好特征工程, 充分挖掘数据的隐含信息。"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- ],
- "source": [
- "# 随机选择5个用户,查看这些用户前后查看文章的相似性\n",
- "sub_user_ids = np.random.choice(user_click_merge.user_id.unique(), size=15, replace=False)\n",
- "sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]\n",
- "\n",
- "sub_user_info.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {},
- "outputs": [],
- "source": [
- "# 上一个版本,这个函数使用的是赛题提供的词向量,但是由于给出的embedding并不是所有的数据的embedding,所以运行下面画图函数的时候会报keyerror的错误\n",
- "# 为了防止出现这个错误,这里修改为使用word2vec训练得到的词向量进行可视化\n",
- "def get_item_sim_list(df):\n",
- " sim_list = []\n",
- " item_list = df['click_article_id'].values\n",
- " for i in range(0, len(item_list)-1):\n",
- " emb1 = item_w2v_emb_dict[str(item_list[i])] # 需要注意的是word2vec训练时候使用的是str类型的数据\n",
- " emb2 = item_w2v_emb_dict[str(item_list[i+1])]\n",
- " sim_list.append(np.dot(emb1,emb2)/(np.linalg.norm(emb1)*(np.linalg.norm(emb2))))\n",
- " sim_list.append(0)\n",
- " return sim_list"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 46,
- "metadata": {},
- "outputs": [
- {
- "data": {
- "image/png": "\n",
- "text/plain": [
- ""
- ]
- },
- "metadata": {
- "needs_background": "light"
- },
- "output_type": "display_data"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Keras Code",
+ "language": "python",
+ "name": "dswipython"
+ },
+ "language_info": {
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [],
+ "description": "",
+ "notebookId": "130008",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "278px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- ],
- "source": [
- "for _, user_df in sub_user_info.groupby('user_id'):\n",
- " item_sim_list = get_item_sim_list(user_df)\n",
- " plt.plot(item_sim_list)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "这里由于对词向量的训练迭代次数不是很多,所以看到的可视化结果不是很准确,可以训练更多次来观察具体的现象。"
- ]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 总结\n",
- "\n",
- "通过数据分析的过程, 我们目前可以得到以下几点重要的信息, 这个对于我们进行后面的特征制作和分析非常有帮助:\n",
- "1. 训练集和测试集的用户id没有重复,也就是测试集里面的用户模型是没有见过的\n",
- "2. 训练集中用户最少的点击文章数是2, 而测试集里面用户最少的点击文章数是1\n",
- "3. 用户对于文章存在重复点击的情况, 但这个都存在于训练集里面\n",
- "4. 同一用户的点击环境存在不唯一的情况,后面做这部分特征的时候可以采用统计特征\n",
- "5. 用户点击文章的次数有很大的区分度,后面可以根据这个制作衡量用户活跃度的特征\n",
- "6. 文章被用户点击的次数也有很大的区分度,后面可以根据这个制作衡量文章热度的特征\n",
- "7. 用户看的新闻,相关性是比较强的,所以往往我们判断用户是否对某篇文章感兴趣的时候, 在很大程度上会和他历史点击过的文章有关\n",
- "8. 用户点击的文章字数有比较大的区别, 这个可以反映用户对于文章字数的区别\n",
- "9. 用户点击过的文章主题也有很大的区别, 这个可以反映用户的主题偏好\n",
- "10.不同用户点击文章的时间差也会有所区别, 这个可以反映用户对于文章时效性的偏好\n",
- "\n",
- "所以根据上面的一些分析,可以更好的帮助我们后面做好特征工程, 充分挖掘数据的隐含信息。"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Keras Code",
- "language": "python",
- "name": "dswipython"
- },
- "language_info": {
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [],
- "description": "",
- "notebookId": "130008",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "278px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
- },
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
- }
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.ipynb" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.ipynb"
index 3a4bccd4e..08bc05222 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.ipynb"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.ipynb"
@@ -1,2107 +1,2107 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 多路召回\n",
- "\n",
- "所谓的“多路召回”策略,就是指采用不同的策略、特征或简单模型,分别召回一部分候选集,然后把候选集混合在一起供后续排序模型使用,可以明显的看出,“多路召回策略”是在“计算速度”和“召回率”之间进行权衡的结果。其中,各种简单策略保证候选集的快速召回,从不同角度设计的策略保证召回率接近理想的状态,不至于损伤排序效果。如下图是多路召回的一个示意图,在多路召回中,每个策略之间毫不相关,所以一般可以写并发多线程同时进行,这样可以更加高效。\n",
- "\n",
- " \n",
- "\n",
- "上图只是一个多路召回的例子,也就是说可以使用多种不同的策略来获取用户排序的候选商品集合,而具体使用哪些召回策略其实是与业务强相关的 ,针对不同的任务就会有对于该业务真实场景下需要考虑的召回规则。例如新闻推荐,召回规则可以是“热门新闻”、“作者召回”、“关键词召回”、“主题召回“、”协同过滤召回“等等。 \n",
- "\n"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:29.834662Z",
- "start_time": "2020-11-16T11:26:27.811511Z"
- }
- },
- "outputs": [],
- "source": [
- "import pandas as pd \n",
- "import numpy as np\n",
- "from tqdm import tqdm \n",
- "from collections import defaultdict \n",
- "import os, math, warnings, math, pickle\n",
- "from tqdm import tqdm\n",
- "import faiss\n",
- "import collections\n",
- "import random\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "from sklearn.preprocessing import LabelEncoder\n",
- "from datetime import datetime\n",
- "from deepctr.feature_column import SparseFeat, VarLenSparseFeat\n",
- "from sklearn.preprocessing import LabelEncoder\n",
- "from tensorflow.python.keras import backend as K\n",
- "from tensorflow.python.keras.models import Model\n",
- "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
- "\n",
- "from deepmatch.models import *\n",
- "from deepmatch.utils import sampledsoftmaxloss\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:31.831215Z",
- "start_time": "2020-11-16T11:26:31.826939Z"
- }
- },
- "outputs": [],
- "source": [
- "data_path = './data_raw/'\n",
- "save_path = './temp_results/'\n",
- "# 做召回评估的一个标志, 如果不进行评估就是直接使用全量数据进行召回\n",
- "metric_recall = False"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取数据\n",
- "在一般的rs比赛中读取数据部分主要分为三种模式, 不同的模式对应的不同的数据集:\n",
- "1. debug模式: 这个的目的是帮助我们基于数据先搭建一个简易的baseline并跑通, 保证写的baseline代码没有什么问题。 由于推荐比赛的数据往往非常巨大, 如果一上来直接采用全部的数据进行分析,搭建baseline框架, 往往会带来时间和设备上的损耗, **所以这时候我们往往需要从海量数据的训练集中随机抽取一部分样本来进行调试(train_click_log_sample)**, 先跑通一个baseline。\n",
- "2. 线下验证模式: 这个的目的是帮助我们在线下基于已有的训练集数据, 来选择好合适的模型和一些超参数。 **所以我们这一块只需要加载整个训练集(train_click_log)**, 然后把整个训练集再分成训练集和验证集。 训练集是模型的训练数据, 验证集部分帮助我们调整模型的参数和其他的一些超参数。\n",
- "3. 线上模式: 我们用debug模式搭建起一个推荐系统比赛的baseline, 用线下验证模式选择好了模型和一些超参数, 这一部分就是真正的对于给定的测试集进行预测, 提交到线上, **所以这一块使用的训练数据集是全量的数据集(train_click_log+test_click_log)**\n",
- "\n",
- "下面就分别对这三种不同的数据读取模式先建立不同的代导入函数, 方便后面针对不同的模式下导入数据。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:34.476240Z",
- "start_time": "2020-11-16T11:26:34.467352Z"
- }
- },
- "outputs": [],
- "source": [
- "# debug模式: 从训练集中划出一部分数据来调试代码\n",
- "def get_all_click_sample(data_path, sample_nums=10000):\n",
- " \"\"\"\n",
- " 训练集中采样一部分数据调试\n",
- " data_path: 原数据的存储路径\n",
- " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
- " \"\"\"\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " all_user_ids = all_click.user_id.unique()\n",
- "\n",
- " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
- " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click\n",
- "\n",
- "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
- "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
- "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
- " if offline:\n",
- " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " else:\n",
- " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
- " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
- "\n",
- " all_click = trn_click.append(tst_click)\n",
- " \n",
- " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
- " return all_click"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:35.168738Z",
- "start_time": "2020-11-16T11:26:35.163210Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取文章的基本属性\n",
- "def get_item_info_df(data_path):\n",
- " item_info_df = pd.read_csv(data_path + 'articles.csv')\n",
- " \n",
- " # 为了方便与训练集中的click_article_id拼接,需要把article_id修改成click_article_id\n",
- " item_info_df = item_info_df.rename(columns={'article_id': 'click_article_id'})\n",
- " \n",
- " return item_info_df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:36.152958Z",
- "start_time": "2020-11-16T11:26:36.146324Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取文章的Embedding数据\n",
- "def get_item_emb_dict(data_path):\n",
- " item_emb_df = pd.read_csv(data_path + 'articles_emb.csv')\n",
- " \n",
- " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
- " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols])\n",
- " # 进行归一化\n",
- " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
- "\n",
- " item_emb_dict = dict(zip(item_emb_df['article_id'], item_emb_np))\n",
- " pickle.dump(item_emb_dict, open(save_path + 'item_content_emb.pkl', 'wb'))\n",
- " \n",
- " return item_emb_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:37.333536Z",
- "start_time": "2020-11-16T11:26:37.329545Z"
- }
- },
- "outputs": [],
- "source": [
- "max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:42.163494Z",
- "start_time": "2020-11-16T11:26:38.018094Z"
- }
- },
- "outputs": [],
- "source": [
- "# 采样数据\n",
- "# all_click_df = get_all_click_sample(data_path)\n",
- "\n",
- "# 全量训练集\n",
- "all_click_df = get_all_click_df(offline=False)\n",
- "\n",
- "# 对时间戳进行归一化,用于在关联规则的时候计算权重\n",
- "all_click_df['click_timestamp'] = all_click_df[['click_timestamp']].apply(max_min_scaler)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:26:44.343500Z",
- "start_time": "2020-11-16T11:26:44.113891Z"
- }
- },
- "outputs": [],
- "source": [
- "item_info_df = get_item_info_df(data_path)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:24.295343Z",
- "start_time": "2020-11-16T11:26:44.398007Z"
- }
- },
- "outputs": [],
- "source": [
- "item_emb_dict = get_item_emb_dict(data_path)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 工具函数"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取用户-文章-时间函数\n",
- "这个在基于关联规则的用户协同过滤的时候会用到"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:33.791656Z",
- "start_time": "2020-11-16T11:27:33.784305Z"
- }
- },
- "outputs": [],
- "source": [
- "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- "def get_user_item_time(click_df):\n",
- " \n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " \n",
- " def make_item_time_pair(df):\n",
- " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
- " \n",
- " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
- " .reset_index().rename(columns={0: 'item_time_list'})\n",
- " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
- " \n",
- " return user_item_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取文章-用户-时间函数\n",
- "这个在基于关联规则的文章协同过滤的时候会用到"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:38.327581Z",
- "start_time": "2020-11-16T11:27:38.321059Z"
- }
- },
- "outputs": [],
- "source": [
- "# 根据时间获取商品被点击的用户序列 {item1: [(user1, time1), (user2, time2)...]...}\n",
- "# 这里的时间是用户点击当前商品的时间,好像没有直接的关系。\n",
- "def get_item_user_time_dict(click_df):\n",
- " def make_user_time_pair(df):\n",
- " return list(zip(df['user_id'], df['click_timestamp']))\n",
- " \n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " item_user_time_df = click_df.groupby('click_article_id')['user_id', 'click_timestamp'].apply(lambda x: make_user_time_pair(x))\\\n",
- " .reset_index().rename(columns={0: 'user_time_list'})\n",
- " \n",
- " item_user_time_dict = dict(zip(item_user_time_df['click_article_id'], item_user_time_df['user_time_list']))\n",
- " return item_user_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取历史和最后一次点击\n",
- "这个在评估召回结果, 特征工程和制作标签转成监督学习测试集的时候回用到"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:50.894683Z",
- "start_time": "2020-11-16T11:27:50.888002Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取当前数据的历史点击和最后一次点击\n",
- "def get_hist_and_last_click(all_click):\n",
- " \n",
- " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
- " click_last_df = all_click.groupby('user_id').tail(1)\n",
- "\n",
- " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
- " def hist_func(user_df):\n",
- " if len(user_df) == 1:\n",
- " return user_df\n",
- " else:\n",
- " return user_df[:-1]\n",
- "\n",
- " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
- "\n",
- " return click_hist_df, click_last_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取文章属性特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:55.893810Z",
- "start_time": "2020-11-16T11:27:55.887623Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取文章id对应的基本属性,保存成字典的形式,方便后面召回阶段,冷启动阶段直接使用\n",
- "def get_item_info_dict(item_info_df):\n",
- " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
- " item_info_df['created_at_ts'] = item_info_df[['created_at_ts']].apply(max_min_scaler)\n",
- " \n",
- " item_type_dict = dict(zip(item_info_df['click_article_id'], item_info_df['category_id']))\n",
- " item_words_dict = dict(zip(item_info_df['click_article_id'], item_info_df['words_count']))\n",
- " item_created_time_dict = dict(zip(item_info_df['click_article_id'], item_info_df['created_at_ts']))\n",
- " \n",
- " return item_type_dict, item_words_dict, item_created_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-13T06:42:38.730939Z",
- "start_time": "2020-11-13T06:42:38.728461Z"
- }
- },
- "source": [
- "### 获取用户历史点击的文章信息"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:27:59.650781Z",
- "start_time": "2020-11-16T11:27:59.640572Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_user_hist_item_info_dict(all_click):\n",
- " \n",
- " # 获取user_id对应的用户历史点击文章类型的集合字典\n",
- " user_hist_item_typs = all_click.groupby('user_id')['category_id'].agg(set).reset_index()\n",
- " user_hist_item_typs_dict = dict(zip(user_hist_item_typs['user_id'], user_hist_item_typs['category_id']))\n",
- " \n",
- " # 获取user_id对应的用户点击文章的集合\n",
- " user_hist_item_ids_dict = all_click.groupby('user_id')['click_article_id'].agg(set).reset_index()\n",
- " user_hist_item_ids_dict = dict(zip(user_hist_item_ids_dict['user_id'], user_hist_item_ids_dict['click_article_id']))\n",
- " \n",
- " # 获取user_id对应的用户历史点击的文章的平均字数字典\n",
- " user_hist_item_words = all_click.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
- " user_hist_item_words_dict = dict(zip(user_hist_item_words['user_id'], user_hist_item_words['words_count']))\n",
- " \n",
- " # 获取user_id对应的用户最后一次点击的文章的创建时间\n",
- " all_click_ = all_click.sort_values('click_timestamp')\n",
- " user_last_item_created_time = all_click_.groupby('user_id')['created_at_ts'].apply(lambda x: x.iloc[-1]).reset_index()\n",
- " \n",
- " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
- " user_last_item_created_time['created_at_ts'] = user_last_item_created_time[['created_at_ts']].apply(max_min_scaler)\n",
- " \n",
- " user_last_item_created_time_dict = dict(zip(user_last_item_created_time['user_id'], \\\n",
- " user_last_item_created_time['created_at_ts']))\n",
- " \n",
- " return user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 获取点击次数最多的topk个文章"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:28:04.761105Z",
- "start_time": "2020-11-16T11:28:04.756419Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取近期点击最多的文章\n",
- "def get_item_topk_click(click_df, k):\n",
- " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
- " return topk_click"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 定义多路召回字典"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:28:08.321506Z",
- "start_time": "2020-11-16T11:28:07.623281Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取文章的属性信息,保存成字典的形式方便查询\n",
- "item_type_dict, item_words_dict, item_created_time_dict = get_item_info_dict(item_info_df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:28:13.791569Z",
- "start_time": "2020-11-16T11:28:13.786522Z"
- }
- },
- "outputs": [],
- "source": [
- "# 定义一个多路召回的字典,将各路召回的结果都保存在这个字典当中\n",
- "user_multi_recall_dict = {'itemcf_sim_itemcf_recall': {},\n",
- " 'embedding_sim_item_recall': {},\n",
- " 'youtubednn_recall': {},\n",
- " 'youtubednn_usercf_recall': {}, \n",
- " 'cold_start_recall': {}}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T05:41:12.710754Z",
- "start_time": "2020-11-16T05:40:57.842614Z"
- }
- },
- "outputs": [],
- "source": [
- "# 提取最后一次点击作为召回评估,如果不需要做召回评估直接使用全量的训练集进行召回(线下验证模型)\n",
- "# 如果不是召回评估,直接使用全量数据进行召回,不用将最后一次提取出来\n",
- "trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 召回效果评估函数\n",
- "做完了召回有时候也需要对当前的召回方法或者参数进行调整以达到更好的召回效果,因为召回的结果决定了最终排序的上限,下面也会提供一个召回评估的方法"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T05:41:18.579118Z",
- "start_time": "2020-11-16T05:41:18.571887Z"
- }
- },
- "outputs": [],
- "source": [
- "# 依次评估召回的前10, 20, 30, 40, 50个文章中的击中率\n",
- "def metrics_recall(user_recall_items_dict, trn_last_click_df, topk=5):\n",
- " last_click_item_dict = dict(zip(trn_last_click_df['user_id'], trn_last_click_df['click_article_id']))\n",
- " user_num = len(user_recall_items_dict)\n",
- " \n",
- " for k in range(10, topk+1, 10):\n",
- " hit_num = 0\n",
- " for user, item_list in user_recall_items_dict.items():\n",
- " # 获取前k个召回的结果\n",
- " tmp_recall_items = [x[0] for x in user_recall_items_dict[user][:k]]\n",
- " if last_click_item_dict[user] in set(tmp_recall_items):\n",
- " hit_num += 1\n",
- " \n",
- " hit_rate = round(hit_num * 1.0 / user_num, 5)\n",
- " print(' topk: ', k, ' : ', 'hit_num: ', hit_num, 'hit_rate: ', hit_rate, 'user_num : ', user_num)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 计算相似性矩阵\n",
- "\n",
- "这一部分主要是通过协同过滤以及向量检索得到相似性矩阵,相似性矩阵主要分为user2user和item2item,下面依次获取基于itemcf的item2item的相似性矩阵,"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### itemcf i2i_sim\n",
- "\n",
- "借鉴KDD2020的去偏商品推荐,在计算item2item相似性矩阵时,使用关联规则,使得计算的文章的相似性还考虑到了:\n",
- "1. 用户点击的时间权重\n",
- "2. 用户点击的顺序权重\n",
- "3. 文章创建的时间权重"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:30:51.872262Z",
- "start_time": "2020-11-16T11:30:51.860099Z"
- }
- },
- "outputs": [],
- "source": [
- "def itemcf_sim(df, item_created_time_dict):\n",
- " \"\"\"\n",
- " 文章与文章之间的相似性矩阵计算\n",
- " :param df: 数据表\n",
- " :item_created_time_dict: 文章创建时间的字典\n",
- " return : 文章与文章的相似性矩阵\n",
- " \n",
- " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
- " \"\"\"\n",
- " \n",
- " user_item_time_dict = get_user_item_time(df)\n",
- " \n",
- " # 计算物品相似度\n",
- " i2i_sim = {}\n",
- " item_cnt = defaultdict(int)\n",
- " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
- " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
- " for loc1, (i, i_click_time) in enumerate(item_time_list):\n",
- " item_cnt[i] += 1\n",
- " i2i_sim.setdefault(i, {})\n",
- " for loc2, (j, j_click_time) in enumerate(item_time_list):\n",
- " if(i == j):\n",
- " continue\n",
- " \n",
- " # 考虑文章的正向顺序点击和反向顺序点击 \n",
- " loc_alpha = 1.0 if loc2 > loc1 else 0.7\n",
- " # 位置信息权重,其中的参数可以调节\n",
- " loc_weight = loc_alpha * (0.9 ** (np.abs(loc2 - loc1) - 1))\n",
- " # 点击时间权重,其中的参数可以调节\n",
- " click_time_weight = np.exp(0.7 ** np.abs(i_click_time - j_click_time))\n",
- " # 两篇文章创建时间的权重,其中的参数可以调节\n",
- " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
- " i2i_sim[i].setdefault(j, 0)\n",
- " # 考虑多种因素的权重计算最终的文章之间的相似度\n",
- " i2i_sim[i][j] += loc_weight * click_time_weight * created_time_weight / math.log(len(item_time_list) + 1)\n",
- " \n",
- " i2i_sim_ = i2i_sim.copy()\n",
- " for i, related_items in i2i_sim.items():\n",
- " for j, wij in related_items.items():\n",
- " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
- " \n",
- " # 将得到的相似性矩阵保存到本地\n",
- " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
- " \n",
- " return i2i_sim_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:47:09.937002Z",
- "start_time": "2020-11-16T11:30:57.394334Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [14:20<00:00, 290.38it/s]\n"
- ]
- }
- ],
- "source": [
- "i2i_sim = itemcf_sim(all_click_df, item_created_time_dict)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### usercf u2u_sim\n",
- "\n",
- "在计算用户之间的相似度的时候,也可以使用一些简单的关联规则,比如用户活跃度权重,这里将用户的点击次数作为用户活跃度的指标"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:11:14.951940Z",
- "start_time": "2020-11-16T09:11:14.945654Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_user_activate_degree_dict(all_click_df):\n",
- " all_click_df_ = all_click_df.groupby('user_id')['click_article_id'].count().reset_index()\n",
- " \n",
- " # 用户活跃度归一化\n",
- " mm = MinMaxScaler()\n",
- " all_click_df_['click_article_id'] = mm.fit_transform(all_click_df_[['click_article_id']])\n",
- " user_activate_degree_dict = dict(zip(all_click_df_['user_id'], all_click_df_['click_article_id']))\n",
- " \n",
- " return user_activate_degree_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:11:19.879276Z",
- "start_time": "2020-11-16T09:11:19.868808Z"
- }
- },
- "outputs": [],
- "source": [
- "def usercf_sim(all_click_df, user_activate_degree_dict):\n",
- " \"\"\"\n",
- " 用户相似性矩阵计算\n",
- " :param all_click_df: 数据表\n",
- " :param user_activate_degree_dict: 用户活跃度的字典\n",
- " return 用户相似性矩阵\n",
- " \n",
- " 思路: 基于用户的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
- " \"\"\"\n",
- " item_user_time_dict = get_item_user_time_dict(all_click_df)\n",
- " \n",
- " u2u_sim = {}\n",
- " user_cnt = defaultdict(int)\n",
- " for item, user_time_list in tqdm(item_user_time_dict.items()):\n",
- " for u, click_time in user_time_list:\n",
- " user_cnt[u] += 1\n",
- " u2u_sim.setdefault(u, {})\n",
- " for v, click_time in user_time_list:\n",
- " u2u_sim[u].setdefault(v, 0)\n",
- " if u == v:\n",
- " continue\n",
- " # 用户平均活跃度作为活跃度的权重,这里的式子也可以改善\n",
- " activate_weight = 100 * 0.5 * (user_activate_degree_dict[u] + user_activate_degree_dict[v]) \n",
- " u2u_sim[u][v] += activate_weight / math.log(len(user_time_list) + 1)\n",
- " \n",
- " u2u_sim_ = u2u_sim.copy()\n",
- " for u, related_users in u2u_sim.items():\n",
- " for v, wij in related_users.items():\n",
- " u2u_sim_[u][v] = wij / math.sqrt(user_cnt[u] * user_cnt[v])\n",
- " \n",
- " # 将得到的相似性矩阵保存到本地\n",
- " pickle.dump(u2u_sim_, open(save_path + 'usercf_u2u_sim.pkl', 'wb'))\n",
- "\n",
- " return u2u_sim_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T06:59:46.701572Z",
- "start_time": "2020-11-16T06:59:26.852246Z"
- }
- },
- "outputs": [],
- "source": [
- "# 由于usercf计算时候太耗费内存了,这里就不直接运行了\n",
- "# 如果是采样的话,是可以运行的\n",
- "user_activate_degree_dict = get_user_activate_degree_dict(all_click_df)\n",
- "u2u_sim = usercf_sim(all_click_df, user_activate_degree_dict)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### item embedding sim\n",
- "\n",
- "使用Embedding计算item之间的相似度是为了后续冷启动的时候可以获取未出现在点击数据中的文章,后面有对冷启动专门的介绍,这里简单的说一下faiss。\n",
- "\n",
- "aiss是Facebook的AI团队开源的一套用于做聚类或者相似性搜索的软件库,底层是用C++实现。Faiss因为超级优越的性能,被广泛应用于推荐相关的业务当中.\n",
- "\n",
- "faiss工具包一般使用在推荐系统中的向量召回部分。在做向量召回的时候要么是u2u,u2i或者i2i,这里的u和i指的是user和item.我们知道在实际的场景中user和item的数量都是海量的,我们最容易想到的基于向量相似度的召回就是使用两层循环遍历user列表或者item列表计算两个向量的相似度,但是这样做在面对海量数据是不切实际的,faiss就是用来加速计算某个查询向量最相似的topk个索引向量。\n",
- "\n",
- "**faiss查询的原理:**\n",
- "\n",
- "faiss使用了PCA和PQ(Product quantization乘积量化)两种技术进行向量压缩和编码,当然还使用了其他的技术进行优化,但是PCA和PQ是其中最核心部分。\n",
- "\n",
- "1. PCA降维算法细节参考下面这个链接进行学习 \n",
- "[主成分分析(PCA)原理总结](https://www.cnblogs.com/pinard/p/6239403.html) \n",
- "\n",
- "2. PQ编码的细节下面这个链接进行学习 \n",
- "[实例理解product quantization算法](http://www.fabwrite.com/productquantization)\n",
- "\n",
- "**faiss使用**\n",
- "\n",
- "[faiss官方教程](https://github.com/facebookresearch/faiss/wiki/Getting-started)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:11:28.631803Z",
- "start_time": "2020-11-16T09:11:28.619926Z"
- }
- },
- "outputs": [],
- "source": [
- "# 向量检索相似度计算\n",
- "# topk指的是每个item, faiss搜索后返回最相似的topk个item\n",
- "def embdding_sim(click_df, item_emb_df, save_path, topk):\n",
- " \"\"\"\n",
- " 基于内容的文章embedding相似性矩阵计算\n",
- " :param click_df: 数据表\n",
- " :param item_emb_df: 文章的embedding\n",
- " :param save_path: 保存路径\n",
- " :patam topk: 找最相似的topk篇\n",
- " return 文章相似性矩阵\n",
- " \n",
- " 思路: 对于每一篇文章, 基于embedding的相似性返回topk个与其最相似的文章, 只不过由于文章数量太多,这里用了faiss进行加速\n",
- " \"\"\"\n",
- " \n",
- " # 文章索引与文章id的字典映射\n",
- " item_idx_2_rawid_dict = dict(zip(item_emb_df.index, item_emb_df['article_id']))\n",
- " \n",
- " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
- " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols].values, dtype=np.float32)\n",
- " # 向量进行单位化\n",
- " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
- " \n",
- " # 建立faiss索引\n",
- " item_index = faiss.IndexFlatIP(item_emb_np.shape[1])\n",
- " item_index.add(item_emb_np)\n",
- " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
- " sim, idx = item_index.search(item_emb_np, topk) # 返回的是列表\n",
- " \n",
- " # 将向量检索的结果保存成原始id的对应关系\n",
- " item_sim_dict = collections.defaultdict(dict)\n",
- " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(item_emb_np)), sim, idx)):\n",
- " target_raw_id = item_idx_2_rawid_dict[target_idx]\n",
- " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
- " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
- " rele_raw_id = item_idx_2_rawid_dict[rele_idx]\n",
- " item_sim_dict[target_raw_id][rele_raw_id] = item_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
- " \n",
- " # 保存i2i相似度矩阵\n",
- " pickle.dump(item_sim_dict, open(save_path + 'emb_i2i_sim.pkl', 'wb')) \n",
- " \n",
- " return item_sim_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T09:32:35.926116Z",
- "start_time": "2020-11-16T09:11:44.586967Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "364047it [00:23, 15292.14it/s]\n"
- ]
- }
- ],
- "source": [
- "item_emb_df = pd.read_csv(data_path + '/articles_emb.csv')\n",
- "emb_i2i_sim = embdding_sim(all_click_df, item_emb_df, save_path, topk=10) # topk可以自行设置"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 召回\n",
- "这个就是我们开篇提到的那个问题, 面的36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模? 我们就可以再召回阶段筛选出用户对于点击文章的候选集合, 从而降低问题的规模。召回常用的策略:\n",
- "* Youtube DNN 召回\n",
- "* 基于文章的召回\n",
- " * 文章的协同过滤\n",
- " * 基于文章embedding的召回\n",
- "* 基于用户的召回\n",
- " * 用户的协同过滤\n",
- " * 用户embedding\n",
- "\n",
- "上面的各种召回方式一部分在基于用户已经看得文章的基础上去召回与这些文章相似的一些文章, 而这个相似性的计算方式不同, 就得到了不同的召回方式, 比如文章的协同过滤, 文章内容的embedding等。还有一部分是根据用户的相似性进行推荐,对于某用户推荐与其相似的其他用户看过的文章,比如用户的协同过滤和用户embedding。 还有一种思路是类似矩阵分解的思路,先计算出用户和文章的embedding之后,就可以直接算用户和文章的相似度, 根据这个相似度进行推荐, 比如YouTube DNN。 我们下面详细来看一下每一个召回方法:"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### YoutubeDNN召回\n",
- "**(这一步是直接获取用户召回的候选文章列表)**\n",
- "\n",
- "[论文下载地址](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)\n",
- "\n",
- "**Youtubednn召回架构**\n",
- "\n",
- "![image-20201111160516562](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201111160516562.png)\n",
- "\n",
- "\n",
- "\n",
- "关于YoutubeDNN原理和应用推荐看王喆的两篇博客:\n",
- "\n",
- "1. [重读Youtube深度学习推荐系统论文,字字珠玑,惊为神文](https://zhuanlan.zhihu.com/p/52169807)\n",
- "2. [YouTube深度学习推荐系统的十大工程问题](https://zhuanlan.zhihu.com/p/52504407)\n",
- "\n",
- "\n",
- "**参考文献:**\n",
- "1. https://zhuanlan.zhihu.com/p/52169807 (YouTubeDNN原理)\n",
- "2. https://zhuanlan.zhihu.com/p/26306795 (Word2Vec知乎众赞文章) --- word2vec放到排序中的w2v的介绍部分\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:13:11.058766Z",
- "start_time": "2020-11-16T10:13:11.041084Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取双塔召回时的训练验证数据\n",
- "# negsample指的是通过滑窗构建样本的时候,负样本的数量\n",
- "def gen_data_set(data, negsample=0):\n",
- " data.sort_values(\"click_timestamp\", inplace=True)\n",
- " item_ids = data['click_article_id'].unique()\n",
- "\n",
- " train_set = []\n",
- " test_set = []\n",
- " for reviewerID, hist in tqdm(data.groupby('user_id')):\n",
- " pos_list = hist['click_article_id'].tolist()\n",
- " \n",
- " if negsample > 0:\n",
- " candidate_set = list(set(item_ids) - set(pos_list)) # 用户没看过的文章里面选择负样本\n",
- " neg_list = np.random.choice(candidate_set,size=len(pos_list)*negsample,replace=True) # 对于每个正样本,选择n个负样本\n",
- " \n",
- " # 长度只有一个的时候,需要把这条数据也放到训练集中,不然的话最终学到的embedding就会有缺失\n",
- " if len(pos_list) == 1:\n",
- " train_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
- " test_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
- " \n",
- " # 滑窗构造正负样本\n",
- " for i in range(1, len(pos_list)):\n",
- " hist = pos_list[:i]\n",
- " \n",
- " if i != len(pos_list) - 1:\n",
- " train_set.append((reviewerID, hist[::-1], pos_list[i], 1, len(hist[::-1]))) # 正样本 [user_id, his_item, pos_item, label, len(his_item)]\n",
- " for negi in range(negsample):\n",
- " train_set.append((reviewerID, hist[::-1], neg_list[i*negsample+negi], 0,len(hist[::-1]))) # 负样本 [user_id, his_item, neg_item, label, len(his_item)]\n",
- " else:\n",
- " # 将最长的那一个序列长度作为测试数据\n",
- " test_set.append((reviewerID, hist[::-1], pos_list[i],1,len(hist[::-1])))\n",
- " \n",
- " random.shuffle(train_set)\n",
- " random.shuffle(test_set)\n",
- " \n",
- " return train_set, test_set\n",
- "\n",
- "# 将输入的数据进行padding,使得序列特征的长度都一致\n",
- "def gen_model_input(train_set,user_profile,seq_max_len):\n",
- "\n",
- " train_uid = np.array([line[0] for line in train_set])\n",
- " train_seq = [line[1] for line in train_set]\n",
- " train_iid = np.array([line[2] for line in train_set])\n",
- " train_label = np.array([line[3] for line in train_set])\n",
- " train_hist_len = np.array([line[4] for line in train_set])\n",
- "\n",
- " train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0)\n",
- " train_model_input = {\"user_id\": train_uid, \"click_article_id\": train_iid, \"hist_article_id\": train_seq_pad,\n",
- " \"hist_len\": train_hist_len}\n",
- "\n",
- " return train_model_input, train_label"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:13:18.124452Z",
- "start_time": "2020-11-16T10:13:18.098284Z"
- }
- },
- "outputs": [],
- "source": [
- "def youtubednn_u2i_dict(data, topk=20): \n",
- " sparse_features = [\"click_article_id\", \"user_id\"]\n",
- " SEQ_LEN = 30 # 用户点击序列的长度,短的填充,长的截断\n",
- " \n",
- " user_profile_ = data[[\"user_id\"]].drop_duplicates('user_id')\n",
- " item_profile_ = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
- " \n",
- " # 类别编码\n",
- " features = [\"click_article_id\", \"user_id\"]\n",
- " feature_max_idx = {}\n",
- " \n",
- " for feature in features:\n",
- " lbe = LabelEncoder()\n",
- " data[feature] = lbe.fit_transform(data[feature])\n",
- " feature_max_idx[feature] = data[feature].max() + 1\n",
- " \n",
- " # 提取user和item的画像,这里具体选择哪些特征还需要进一步的分析和考虑\n",
- " user_profile = data[[\"user_id\"]].drop_duplicates('user_id')\n",
- " item_profile = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
- " \n",
- " user_index_2_rawid = dict(zip(user_profile['user_id'], user_profile_['user_id']))\n",
- " item_index_2_rawid = dict(zip(item_profile['click_article_id'], item_profile_['click_article_id']))\n",
- " \n",
- " # 划分训练和测试集\n",
- " # 由于深度学习需要的数据量通常都是非常大的,所以为了保证召回的效果,往往会通过滑窗的形式扩充训练样本\n",
- " train_set, test_set = gen_data_set(data, 0)\n",
- " # 整理输入数据,具体的操作可以看上面的函数\n",
- " train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)\n",
- " test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)\n",
- " \n",
- " # 确定Embedding的维度\n",
- " embedding_dim = 16\n",
- " \n",
- " # 将数据整理成模型可以直接输入的形式\n",
- " user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),\n",
- " VarLenSparseFeat(SparseFeat('hist_article_id', feature_max_idx['click_article_id'], embedding_dim,\n",
- " embedding_name=\"click_article_id\"), SEQ_LEN, 'mean', 'hist_len'),]\n",
- " item_feature_columns = [SparseFeat('click_article_id', feature_max_idx['click_article_id'], embedding_dim)]\n",
- " \n",
- " # 模型的定义 \n",
- " # num_sampled: 负采样时的样本数量\n",
- " model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, embedding_dim))\n",
- " # 模型编译\n",
- " model.compile(optimizer=\"adam\", loss=sampledsoftmaxloss) \n",
- " \n",
- " # 模型训练,这里可以定义验证集的比例,如果设置为0的话就是全量数据直接进行训练\n",
- " history = model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1, validation_split=0.0)\n",
- " \n",
- " # 训练完模型之后,提取训练的Embedding,包括user端和item端\n",
- " test_user_model_input = test_model_input\n",
- " all_item_model_input = {\"click_article_id\": item_profile['click_article_id'].values}\n",
- "\n",
- " user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)\n",
- " item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)\n",
- " \n",
- " # 保存当前的item_embedding 和 user_embedding 排序的时候可能能够用到,但是需要注意保存的时候需要和原始的id对应\n",
- " user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12)\n",
- " item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12)\n",
- " \n",
- " # embedding保存之前归一化一下\n",
- " user_embs = user_embs / np.linalg.norm(user_embs, axis=1, keepdims=True)\n",
- " item_embs = item_embs / np.linalg.norm(item_embs, axis=1, keepdims=True)\n",
- " \n",
- " # 将Embedding转换成字典的形式方便查询\n",
- " raw_user_id_emb_dict = {user_index_2_rawid[k]: \\\n",
- " v for k, v in zip(user_profile['user_id'], user_embs)}\n",
- " raw_item_id_emb_dict = {item_index_2_rawid[k]: \\\n",
- " v for k, v in zip(item_profile['click_article_id'], item_embs)}\n",
- " # 将Embedding保存到本地\n",
- " pickle.dump(raw_user_id_emb_dict, open(save_path + 'user_youtube_emb.pkl', 'wb'))\n",
- " pickle.dump(raw_item_id_emb_dict, open(save_path + 'item_youtube_emb.pkl', 'wb'))\n",
- " \n",
- " # faiss紧邻搜索,通过user_embedding 搜索与其相似性最高的topk个item\n",
- " index = faiss.IndexFlatIP(embedding_dim)\n",
- " # 上面已经进行了归一化,这里可以不进行归一化了\n",
- "# faiss.normalize_L2(user_embs)\n",
- "# faiss.normalize_L2(item_embs)\n",
- " index.add(item_embs) # 将item向量构建索引\n",
- " sim, idx = index.search(np.ascontiguousarray(user_embs), topk) # 通过user去查询最相似的topk个item\n",
- " \n",
- " user_recall_items_dict = collections.defaultdict(dict)\n",
- " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(test_user_model_input['user_id'], sim, idx)):\n",
- " target_raw_id = user_index_2_rawid[target_idx]\n",
- " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
- " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
- " rele_raw_id = item_index_2_rawid[rele_idx]\n",
- " user_recall_items_dict[target_raw_id][rele_raw_id] = user_recall_items_dict.get(target_raw_id, {})\\\n",
- " .get(rele_raw_id, 0) + sim_value\n",
- " \n",
- " user_recall_items_dict = {k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in user_recall_items_dict.items()}\n",
- " # 将召回的结果进行排序\n",
- " \n",
- " # 保存召回的结果\n",
- " # 这里是直接通过向量的方式得到了召回结果,相比于上面的召回方法,上面的只是得到了i2i及u2u的相似性矩阵,还需要进行协同过滤召回才能得到召回结果\n",
- " # 可以直接对这个召回结果进行评估,为了方便可以统一写一个评估函数对所有的召回结果进行评估\n",
- " pickle.dump(user_recall_items_dict, open(save_path + 'youtube_u2i_dict.pkl', 'wb'))\n",
- " return user_recall_items_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T10:21:46.420014Z",
- "start_time": "2020-11-16T10:13:35.351131Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [02:02<00:00, 2038.57it/s]\n"
- ]
- },
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "keep_dims is deprecated, use keepdims instead\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Deprecated in favor of operator or tf.math.divide.\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
- "1149673/1149673 [==============================] - 216s 188us/sample - loss: 0.1326\n"
- ]
- },
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "250000it [00:32, 7720.75it/s]\n"
- ]
- }
- ],
- "source": [
- "# 由于这里需要做召回评估,所以讲训练集中的最后一次点击都提取了出来\n",
- "if not metric_recall:\n",
- " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(all_click_df, topk=20)\n",
- "else:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(trn_hist_click_df, topk=20)\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['youtubednn_recall'], trn_last_click_df, topk=20)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### itemcf recall\n",
- "\n",
- "上面已经通过协同过滤,Embedding检索的方式得到了文章的相似度矩阵,下面使用协同过滤的思想,给用户召回与其历史文章相似的文章。\n",
- "这里在召回的时候,也是用了关联规则的方式:\n",
- "1. 考虑相似文章与历史点击文章顺序的权重(细节看代码)\n",
- "2. 考虑文章创建时间的权重,也就是考虑相似文章与历史点击文章创建时间差的权重\n",
- "3. 考虑文章内容相似度权重(使用Embedding计算相似文章相似度,但是这里需要注意,在Embedding的时候并没有计算所有商品两两之间的相似度,所以相似的文章与历史点击文章不存在相似度,需要做特殊处理)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T11:48:40.580553Z",
- "start_time": "2020-11-16T11:48:40.567130Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于商品的召回i2i\n",
- "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
- " \"\"\"\n",
- " 基于文章协同过滤的召回\n",
- " :param user_id: 用户id\n",
- " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- " :param i2i_sim: 字典,文章相似性矩阵\n",
- " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
- " :param recall_item_num: 整数, 最后的召回文章数量\n",
- " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
- " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
- " \n",
- " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
- " \"\"\"\n",
- " # 获取用户历史交互的文章\n",
- " user_hist_items = user_item_time_dict[user_id]\n",
- " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
- " \n",
- " item_rank = {}\n",
- " for loc, (i, click_time) in enumerate(user_hist_items):\n",
- " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
- " if j in user_hist_items_:\n",
- " continue\n",
- " \n",
- " # 文章创建时间差权重\n",
- " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
- " # 相似文章和历史点击文章序列中历史文章所在的位置权重\n",
- " loc_weight = (0.9 ** (len(user_hist_items) - loc))\n",
- " \n",
- " content_weight = 1.0\n",
- " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
- " content_weight += emb_i2i_sim[i][j]\n",
- " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
- " content_weight += emb_i2i_sim[j][i]\n",
- " \n",
- " item_rank.setdefault(j, 0)\n",
- " item_rank[j] += created_time_weight * loc_weight * content_weight * wij\n",
- " \n",
- " # 不足10个,用热门商品补全\n",
- " if len(item_rank) < recall_item_num:\n",
- " for i, item in enumerate(item_topk_click):\n",
- " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
- " continue\n",
- " item_rank[item] = - i - 100 # 随便给个负数就行\n",
- " if len(item_rank) == recall_item_num:\n",
- " break\n",
- " \n",
- " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
- " \n",
- " return item_rank"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### itemcf sim召回"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T14:41:23.433038Z",
- "start_time": "2020-11-16T11:48:46.286350Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [2:51:13<00:00, 24.33it/s] \n"
- ]
- }
- ],
- "source": [
- "# 先进行itemcf召回, 为了召回评估,所以提取最后一次点击\n",
- "\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "\n",
- "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
- "emb_i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb'))\n",
- "\n",
- "sim_item_topk = 20\n",
- "recall_item_num = 10\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, \\\n",
- " i2i_sim, sim_item_topk, recall_item_num, \\\n",
- " item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
- "\n",
- "user_multi_recall_dict['itemcf_sim_itemcf_recall'] = user_recall_items_dict\n",
- "pickle.dump(user_multi_recall_dict['itemcf_sim_itemcf_recall'], open(save_path + 'itemcf_recall_dict.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['itemcf_sim_itemcf_recall'], trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### embedding sim 召回"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T15:04:51.527795Z",
- "start_time": "2020-11-16T14:59:03.907519Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [04:35<00:00, 905.85it/s] \n"
- ]
- }
- ],
- "source": [
- "# 这里是为了召回评估,所以提取最后一次点击\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
- "\n",
- "sim_item_topk = 20\n",
- "recall_item_num = 10\n",
- "\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
- " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
- " \n",
- "user_multi_recall_dict['embedding_sim_item_recall'] = user_recall_items_dict\n",
- "pickle.dump(user_multi_recall_dict['embedding_sim_item_recall'], open(save_path + 'embedding_sim_item_recall.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['embedding_sim_item_recall'], trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### usercf召回\n",
- "\n",
- "基于用户协同过滤,核心思想是给用户推荐与其相似的用户历史点击文章,因为这里涉及到了相似用户的历史文章,这里仍然可以加上一些关联规则来给用户可能点击的文章进行加权,这里使用的关联规则主要是考虑相似用户的历史点击文章与被推荐用户历史点击商品的关系权重,而这里的关系就可以直接借鉴基于物品的协同过滤相似的做法,只不过这里是对被推荐物品关系的一个累加的过程,下面是使用的一些关系权重,及相关的代码:\n",
- "\n",
- "1. 计算被推荐用户历史点击文章与相似用户历史点击文章的相似度,文章创建时间差,相对位置的总和,作为各自的权重"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:09:32.293990Z",
- "start_time": "2020-11-17T02:09:32.278678Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于用户的召回 u2u2i\n",
- "def user_based_recommend(user_id, user_item_time_dict, u2u_sim, sim_user_topk, recall_item_num, \n",
- " item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
- " \"\"\"\n",
- " 基于文章协同过滤的召回\n",
- " :param user_id: 用户id\n",
- " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
- " :param u2u_sim: 字典,文章相似性矩阵\n",
- " :param sim_user_topk: 整数, 选择与当前用户最相似的前k个用户\n",
- " :param recall_item_num: 整数, 最后的召回文章数量\n",
- " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
- " :param item_created_time_dict: 文章创建时间列表\n",
- " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
- " \n",
- " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
- " \"\"\"\n",
- " # 历史交互\n",
- " user_item_time_list = user_item_time_dict[user_id] # [(item1, time1), (item2, time2)..]\n",
- " user_hist_items = set([i for i, t in user_item_time_list]) # 存在一个用户与某篇文章的多次交互, 这里得去重\n",
- " \n",
- " items_rank = {}\n",
- " for sim_u, wuv in sorted(u2u_sim[user_id].items(), key=lambda x: x[1], reverse=True)[:sim_user_topk]:\n",
- " for i, click_time in user_item_time_dict[sim_u]:\n",
- " if i in user_hist_items:\n",
- " continue\n",
- " items_rank.setdefault(i, 0)\n",
- " \n",
- " loc_weight = 1.0\n",
- " content_weight = 1.0\n",
- " created_time_weight = 1.0\n",
- " \n",
- " # 当前文章与该用户看的历史文章进行一个权重交互\n",
- " for loc, (j, click_time) in enumerate(user_item_time_list):\n",
- " # 点击时的相对位置权重\n",
- " loc_weight += 0.9 ** (len(user_item_time_list) - loc)\n",
- " # 内容相似性权重\n",
- " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
- " content_weight += emb_i2i_sim[i][j]\n",
- " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
- " content_weight += emb_i2i_sim[j][i]\n",
- " \n",
- " # 创建时间差权重\n",
- " created_time_weight += np.exp(0.8 * np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
- " \n",
- " items_rank[i] += loc_weight * content_weight * created_time_weight * wuv\n",
- " \n",
- " # 热度补全\n",
- " if len(items_rank) < recall_item_num:\n",
- " for i, item in enumerate(item_topk_click):\n",
- " if item in items_rank.items(): # 填充的item应该不在原来的列表中\n",
- " continue\n",
- " items_rank[item] = - i - 100 # 随便给个复数就行\n",
- " if len(items_rank) == recall_item_num:\n",
- " break\n",
- " \n",
- " items_rank = sorted(items_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num] \n",
- " \n",
- " return items_rank"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### usercf sim召回"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:05:41.652501Z",
- "start_time": "2020-11-16T07:05:40.953871Z"
- }
- },
- "outputs": [],
- "source": [
- "# 这里是为了召回评估,所以提取最后一次点击\n",
- "# 由于usercf中计算user之间的相似度的过程太费内存了,全量数据这里就没有跑,跑了一个采样之后的数据\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- " \n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "\n",
- "u2u_sim = pickle.load(open(save_path + 'usercf_u2u_sim.pkl', 'rb'))\n",
- "\n",
- "sim_user_topk = 20\n",
- "recall_item_num = 10\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
- " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim) \n",
- "\n",
- "pickle.dump(user_recall_items_dict, open(save_path + 'usercf_u2u2i_recall.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_recall_items_dict, trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T03:09:35.853516Z",
- "start_time": "2020-11-16T03:09:35.737625Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### user embedding sim召回\n",
- "\n",
- "虽然没有直接跑usercf的计算用户之间的相似度,为了验证上述基于用户的协同过滤的代码,下面使用了YoutubeDNN过程中产生的user embedding来进行向量检索每个user最相似的topk个user,在使用这里得到的u2u的相似性矩阵,使用usercf进行召回,具体代码如下"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:09:46.807811Z",
- "start_time": "2020-11-17T02:09:46.798033Z"
- }
- },
- "outputs": [],
- "source": [
- "# 使用Embedding的方式获取u2u的相似性矩阵\n",
- "# topk指的是每个user, faiss搜索后返回最相似的topk个user\n",
- "def u2u_embdding_sim(click_df, user_emb_dict, save_path, topk):\n",
- " \n",
- " user_list = []\n",
- " user_emb_list = []\n",
- " for user_id, user_emb in user_emb_dict.items():\n",
- " user_list.append(user_id)\n",
- " user_emb_list.append(user_emb)\n",
- " \n",
- " user_index_2_rawid_dict = {k: v for k, v in zip(range(len(user_list)), user_list)} \n",
- " \n",
- " user_emb_np = np.array(user_emb_list, dtype=np.float32)\n",
- " \n",
- " # 建立faiss索引\n",
- " user_index = faiss.IndexFlatIP(user_emb_np.shape[1])\n",
- " user_index.add(user_emb_np)\n",
- " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
- " sim, idx = user_index.search(user_emb_np, topk) # 返回的是列表\n",
- " \n",
- " # 将向量检索的结果保存成原始id的对应关系\n",
- " user_sim_dict = collections.defaultdict(dict)\n",
- " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(user_emb_np)), sim, idx)):\n",
- " target_raw_id = user_index_2_rawid_dict[target_idx]\n",
- " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
- " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
- " rele_raw_id = user_index_2_rawid_dict[rele_idx]\n",
- " user_sim_dict[target_raw_id][rele_raw_id] = user_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
- " \n",
- " # 保存i2i相似度矩阵\n",
- " pickle.dump(user_sim_dict, open(save_path + 'youtube_u2u_sim.pkl', 'wb')) \n",
- " return user_sim_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:14:31.355905Z",
- "start_time": "2020-11-17T02:09:53.236531Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "250000it [00:23, 10507.45it/s]\n"
- ]
- }
- ],
- "source": [
- "# 读取YoutubeDNN过程中产生的user embedding, 然后使用faiss计算用户之间的相似度\n",
- "# 这里需要注意,这里得到的user embedding其实并不是很好,因为YoutubeDNN中使用的是用户点击序列来训练的user embedding,\n",
- "# 如果序列普遍都比较短的话,其实效果并不是很好\n",
- "user_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
- "u2u_sim = u2u_embdding_sim(all_click_df, user_emb_dict, save_path, topk=10)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "通过YoutubeDNN得到的user_embedding"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:49:40.755431Z",
- "start_time": "2020-11-17T02:28:47.003514Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [19:43<00:00, 211.22it/s]\n"
- ]
- }
- ],
- "source": [
- "# 使用召回评估函数验证当前召回方式的效果\n",
- "if metric_recall:\n",
- " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
- "else:\n",
- " trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "u2u_sim = pickle.load(open(save_path + 'youtube_u2u_sim.pkl', 'rb'))\n",
- "\n",
- "sim_user_topk = 20\n",
- "recall_item_num = 10\n",
- "\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
- " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
- " \n",
- "user_multi_recall_dict['youtubednn_usercf_recall'] = user_recall_items_dict\n",
- "pickle.dump(user_multi_recall_dict['youtubednn_usercf_recall'], open(save_path + 'youtubednn_usercf_recall.pkl', 'wb'))\n",
- "\n",
- "if metric_recall:\n",
- " # 召回效果评估\n",
- " metrics_recall(user_multi_recall_dict['youtubednn_usercf_recall'], trn_last_click_df, topk=recall_item_num)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:07:44.326253Z",
- "start_time": "2020-11-16T07:07:43.798931Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 冷启动问题"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "**冷启动问题可以分成三类:文章冷启动,用户冷启动,系统冷启动。**\n",
- "\n",
- "- 文章冷启动:对于一个平台系统新加入的文章,该文章没有任何的交互记录,如何推荐给用户的问题。(对于我们场景可以认为是,日志数据中没有出现过的文章都可以认为是冷启动的文章)\n",
- "- 用户冷启动:对于一个平台系统新来的用户,该用户还没有文章的交互信息,如何给该用户进行推荐。(对于我们场景就是,测试集中的用户是否在测试集对应的log数据中出现过,如果没有出现过,那么可以认为该用户是冷启动用户。但是有时候并没有这么严格,我们也可以自己设定某些指标来判别哪些用户是冷启动用户,比如通过使用时长,点击率,留存率等等)\n",
- "- 系统冷启动:就是对于一个平台刚上线,还没有任何的相关历史数据,此时就是系统冷启动,其实也就是前面两种的一个综合。\n",
- "\n",
- "**当前场景下冷启动问题的分析:**\n",
- "\n",
- "对当前的数据进行分析会发现,日志中所有出现过的点击文章只有3w多个,而整个文章库中却有30多万,那么测试集中的用户最后一次点击是否会点击没有出现在日志中的文章呢?如果存在这种情况,说明用户点击的文章之前没有任何的交互信息,这也就是我们所说的文章冷启动。通过数据分析还可以发现,测试集用户只有一次点击的数据占得比例还不少,其实仅仅通过用户的一次点击就给用户推荐文章使用模型的方式也是比较难的,这里其实也可以考虑用户冷启动的问题,但是这里只给出物品冷启动的一些解决方案及代码,关于用户冷启动的话提一些可行性的做法。\n",
- "\n",
- "1. 文章冷启动(没有冷启动的探索问题) \n",
- " 其实我们这里不是为了做文章的冷启动而做冷启动,而是猜测用户可能会点击一些没有在log数据中出现的文章,我们要做的就是如何从将近27万的文章中选择一些文章作为用户冷启动的文章,这里其实也可以看成是一种召回策略,我们这里就采用简单的比较好理解的基于规则的召回策略来获取用户可能点击的未出现在log数据中的文章。\n",
- " 现在的问题变成了:如何给每个用户考虑从27万个商品中获取一小部分商品?随机选一些可能是一种方案。下面给出一些参考的方案。\n",
- " 1. 首先基于Embedding召回一部分与用户历史相似的文章\n",
- " 2. 从基于Embedding召回的文章中通过一些规则过滤掉一些文章,使得留下的文章用户更可能点击。我们这里的规则,可以是,留下那些与用户历史点击文章主题相同的文章,或者字数相差不大的文章。并且留下的文章尽量是与测试集用户最后一次点击时间更接近的文章,或者是当天的文章也行。\n",
- "2. 用户冷启动 \n",
- " 这里对测试集中的用户点击数据进行分析会发现,测试集中有百分之20的用户只有一次点击,那么这些点击特别少的用户的召回是不是可以单独做一些策略上的补充呢?或者是在排序后直接基于规则加上一些文章呢?这些都可以去尝试,这里没有提供具体的做法。\n",
- " \n",
- "**注意:** \n",
- "\n",
- "这里看似和基于embedding计算的item之间相似度然后做itemcf是一致的,但是现在我们的目的不一样,我们这里的目的是找到相似的向量,并且还没有出现在log日志中的商品,再加上一些其他的冷启动的策略,这里需要找回的数量会偏多一点,不然被筛选完之后可能都没有文章了"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T04:30:23.027164Z",
- "start_time": "2020-11-17T04:23:09.960235Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [05:01<00:00, 828.60it/s] \n"
- ]
- }
- ],
- "source": [
- "# 先进行itemcf召回,这里不需要做召回评估,这里只是一种策略\n",
- "trn_hist_click_df = all_click_df\n",
- "\n",
- "user_recall_items_dict = collections.defaultdict(dict)\n",
- "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
- "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
- "\n",
- "sim_item_topk = 150\n",
- "recall_item_num = 100 # 稍微召回多一点文章,便于后续的规则筛选\n",
- "\n",
- "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
- "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
- " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
- " recall_item_num, item_topk_click,item_created_time_dict, emb_i2i_sim)\n",
- "pickle.dump(user_recall_items_dict, open(save_path + 'cold_start_items_raw_dict.pkl', 'wb'))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:11:39.267581Z",
- "start_time": "2020-11-17T06:11:39.252563Z"
- }
- },
- "outputs": [],
- "source": [
- "# 基于规则进行文章过滤\n",
- "# 保留文章主题与用户历史浏览主题相似的文章\n",
- "# 保留文章字数与用户历史浏览文章字数相差不大的文章\n",
- "# 保留最后一次点击当天的文章\n",
- "# 按照相似度返回最终的结果\n",
- "\n",
- "def get_click_article_ids_set(all_click_df):\n",
- " return set(all_click_df.click_article_id.values)\n",
- "\n",
- "def cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
- " user_last_item_created_time_dict, item_type_dict, item_words_dict, \n",
- " item_created_time_dict, click_article_ids_set, recall_item_num):\n",
- " \"\"\"\n",
- " 冷启动的情况下召回一些文章\n",
- " :param user_recall_items_dict: 基于内容embedding相似性召回来的很多文章, 字典, {user1: [(item1, item2), ..], }\n",
- " :param user_hist_item_typs_dict: 字典, 用户点击的文章的主题映射\n",
- " :param user_hist_item_words_dict: 字典, 用户点击的历史文章的字数映射\n",
- " :param user_last_item_created_time_idct: 字典,用户点击的历史文章创建时间映射\n",
- " :param item_tpye_idct: 字典,文章主题映射\n",
- " :param item_words_dict: 字典,文章字数映射\n",
- " :param item_created_time_dict: 字典, 文章创建时间映射\n",
- " :param click_article_ids_set: 集合,用户点击过得文章, 也就是日志里面出现过的文章\n",
- " :param recall_item_num: 召回文章的数量, 这个指的是没有出现在日志里面的文章数量\n",
- " \"\"\"\n",
- " \n",
- " cold_start_user_items_dict = {}\n",
- " for user, item_list in tqdm(user_recall_items_dict.items()):\n",
- " cold_start_user_items_dict.setdefault(user, [])\n",
- " for item, score in item_list:\n",
- " # 获取历史文章信息\n",
- " hist_item_type_set = user_hist_item_typs_dict[user]\n",
- " hist_mean_words = user_hist_item_words_dict[user]\n",
- " hist_last_item_created_time = user_last_item_created_time_dict[user]\n",
- " hist_last_item_created_time = datetime.fromtimestamp(hist_last_item_created_time)\n",
- " \n",
- " # 获取当前召回文章的信息\n",
- " curr_item_type = item_type_dict[item]\n",
- " curr_item_words = item_words_dict[item]\n",
- " curr_item_created_time = item_created_time_dict[item]\n",
- " curr_item_created_time = datetime.fromtimestamp(curr_item_created_time)\n",
- "\n",
- " # 首先,文章不能出现在用户的历史点击中, 然后根据文章主题,文章单词数,文章创建时间进行筛选\n",
- " if curr_item_type not in hist_item_type_set or \\\n",
- " item in click_article_ids_set or \\\n",
- " abs(curr_item_words - hist_mean_words) > 200 or \\\n",
- " abs((curr_item_created_time - hist_last_item_created_time).days) > 90: \n",
- " continue\n",
- " \n",
- " cold_start_user_items_dict[user].append((item, score)) # {user1: [(item1, score1), (item2, score2)..]...}\n",
- " \n",
- " # 需要控制一下冷启动召回的数量\n",
- " cold_start_user_items_dict = {k: sorted(v, key=lambda x:x[1], reverse=True)[:recall_item_num] \\\n",
- " for k, v in cold_start_user_items_dict.items()}\n",
- " \n",
- " pickle.dump(cold_start_user_items_dict, open(save_path + 'cold_start_user_items_dict.pkl', 'wb'))\n",
- " \n",
- " return cold_start_user_items_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:35:38.758278Z",
- "start_time": "2020-11-17T06:31:40.164332Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [01:49<00:00, 2293.37it/s]\n"
- ]
- }
- ],
- "source": [
- "all_click_df_ = all_click_df.copy()\n",
- "all_click_df_ = all_click_df_.merge(item_info_df, how='left', on='click_article_id')\n",
- "user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict = get_user_hist_item_info_dict(all_click_df_)\n",
- "click_article_ids_set = get_click_article_ids_set(all_click_df)\n",
- "# 需要注意的是\n",
- "# 这里使用了很多规则来筛选冷启动的文章,所以前面再召回的阶段就应该尽可能的多召回一些文章,否则很容易被删掉\n",
- "cold_start_user_items_dict = cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
- " user_last_item_created_time_dict, item_type_dict, item_words_dict, \\\n",
- " item_created_time_dict, click_article_ids_set, recall_item_num)\n",
- "\n",
- "user_multi_recall_dict['cold_start_recall'] = cold_start_user_items_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-16T07:13:33.099298Z",
- "start_time": "2020-11-16T07:13:32.655036Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 多路召回合并\n",
- "多路召回合并就是将前面所有的召回策略得到的用户文章列表合并起来,下面是对前面所有召回结果的汇总\n",
- "1. 基于itemcf计算的item之间的相似度sim进行的召回 \n",
- "2. 基于embedding搜索得到的item之间的相似度进行的召回\n",
- "3. YoutubeDNN召回\n",
- "4. YoutubeDNN得到的user之间的相似度进行的召回\n",
- "5. 基于冷启动策略的召回\n",
- "\n",
- "**注意:** \n",
- "在做召回评估的时候就会发现有些召回的效果不错有些召回的效果很差,所以对每一路召回的结果,我们可以认为的定义一些权重,来做最终的相似度融合"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 38,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:02:16.033971Z",
- "start_time": "2020-11-17T07:02:16.019819Z"
- }
- },
- "outputs": [],
- "source": [
- "def combine_recall_results(user_multi_recall_dict, weight_dict=None, topk=25):\n",
- " final_recall_items_dict = {}\n",
- " \n",
- " # 对每一种召回结果按照用户进行归一化,方便后面多种召回结果,相同用户的物品之间权重相加\n",
- " def norm_user_recall_items_sim(sorted_item_list):\n",
- " # 如果冷启动中没有文章或者只有一篇文章,直接返回,出现这种情况的原因可能是冷启动召回的文章数量太少了,\n",
- " # 基于规则筛选之后就没有文章了, 这里还可以做一些其他的策略性的筛选\n",
- " if len(sorted_item_list) < 2:\n",
- " return sorted_item_list\n",
- " \n",
- " min_sim = sorted_item_list[-1][1]\n",
- " max_sim = sorted_item_list[0][1]\n",
- " \n",
- " norm_sorted_item_list = []\n",
- " for item, score in sorted_item_list:\n",
- " if max_sim > 0:\n",
- " norm_score = 1.0 * (score - min_sim) / (max_sim - min_sim) if max_sim > min_sim else 1.0\n",
- " else:\n",
- " norm_score = 0.0\n",
- " norm_sorted_item_list.append((item, norm_score))\n",
- " \n",
- " return norm_sorted_item_list\n",
- " \n",
- " print('多路召回合并...')\n",
- " for method, user_recall_items in tqdm(user_multi_recall_dict.items()):\n",
- " print(method + '...')\n",
- " # 在计算最终召回结果的时候,也可以为每一种召回结果设置一个权重\n",
- " if weight_dict == None:\n",
- " recall_method_weight = 1\n",
- " else:\n",
- " recall_method_weight = weight_dict[method]\n",
- " \n",
- " for user_id, sorted_item_list in user_recall_items.items(): # 进行归一化\n",
- " user_recall_items[user_id] = norm_user_recall_items_sim(sorted_item_list)\n",
- " \n",
- " for user_id, sorted_item_list in user_recall_items.items():\n",
- " # print('user_id')\n",
- " final_recall_items_dict.setdefault(user_id, {})\n",
- " for item, score in sorted_item_list:\n",
- " final_recall_items_dict[user_id].setdefault(item, 0)\n",
- " final_recall_items_dict[user_id][item] += recall_method_weight * score \n",
- " \n",
- " final_recall_items_dict_rank = {}\n",
- " # 多路召回时也可以控制最终的召回数量\n",
- " for user, recall_item_dict in final_recall_items_dict.items():\n",
- " final_recall_items_dict_rank[user] = sorted(recall_item_dict.items(), key=lambda x: x[1], reverse=True)[:topk]\n",
- "\n",
- " # 将多路召回后的最终结果字典保存到本地\n",
- " pickle.dump(final_recall_items_dict_rank, open(os.path.join(save_path, 'final_recall_items_dict.pkl'),'wb'))\n",
- "\n",
- " return final_recall_items_dict_rank"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:02:21.078455Z",
- "start_time": "2020-11-17T07:02:21.074060Z"
- }
- },
- "outputs": [],
- "source": [
- "# 这里直接对多路召回的权重给了一个相同的值,其实可以根据前面召回的情况来调整参数的值\n",
- "weight_dict = {'itemcf_sim_itemcf_recall': 1.0,\n",
- " 'embedding_sim_item_recall': 1.0,\n",
- " 'youtubednn_recall': 1.0,\n",
- " 'youtubednn_usercf_recall': 1.0, \n",
- " 'cold_start_recall': 1.0}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:04:35.747924Z",
- "start_time": "2020-11-17T07:02:26.889573Z"
- }
- },
- "outputs": [
+ "cells": [
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 0%| | 0/5 [00:00, ?it/s]"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 多路召回\n",
+ "\n",
+ "所谓的“多路召回”策略,就是指采用不同的策略、特征或简单模型,分别召回一部分候选集,然后把候选集混合在一起供后续排序模型使用,可以明显的看出,“多路召回策略”是在“计算速度”和“召回率”之间进行权衡的结果。其中,各种简单策略保证候选集的快速召回,从不同角度设计的策略保证召回率接近理想的状态,不至于损伤排序效果。如下图是多路召回的一个示意图,在多路召回中,每个策略之间毫不相关,所以一般可以写并发多线程同时进行,这样可以更加高效。\n",
+ "\n",
+ " \n",
+ "\n",
+ "上图只是一个多路召回的例子,也就是说可以使用多种不同的策略来获取用户排序的候选商品集合,而具体使用哪些召回策略其实是与业务强相关的 ,针对不同的任务就会有对于该业务真实场景下需要考虑的召回规则。例如新闻推荐,召回规则可以是“热门新闻”、“作者召回”、“关键词召回”、“主题召回“、”协同过滤召回“等等。 \n",
+ "\n"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "多路召回合并...\n",
- "itemcf_sim_itemcf_recall...\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 导包"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 20%|██ | 1/5 [00:08<00:34, 8.66s/it]"
- ]
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:29.834662Z",
+ "start_time": "2020-11-16T11:26:27.811511Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd \n",
+ "import numpy as np\n",
+ "from tqdm import tqdm \n",
+ "from collections import defaultdict \n",
+ "import os, math, warnings, math, pickle\n",
+ "from tqdm import tqdm\n",
+ "import faiss\n",
+ "import collections\n",
+ "import random\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "from sklearn.preprocessing import LabelEncoder\n",
+ "from datetime import datetime\n",
+ "from deepctr.feature_column import SparseFeat, VarLenSparseFeat\n",
+ "from sklearn.preprocessing import LabelEncoder\n",
+ "from tensorflow.python.keras import backend as K\n",
+ "from tensorflow.python.keras.models import Model\n",
+ "from tensorflow.python.keras.preprocessing.sequence import pad_sequences\n",
+ "\n",
+ "from deepmatch.models import *\n",
+ "from deepmatch.utils import sampledsoftmaxloss\n",
+ "warnings.filterwarnings('ignore')"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "embedding_sim_item_recall...\n"
- ]
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:31.831215Z",
+ "start_time": "2020-11-16T11:26:31.826939Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data_path = './data_raw/'\n",
+ "save_path = './temp_results/'\n",
+ "# 做召回评估的一个标志, 如果不进行评估就是直接使用全量数据进行召回\n",
+ "metric_recall = False"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 40%|████ | 2/5 [00:16<00:24, 8.29s/it]"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取数据\n",
+ "在一般的rs比赛中读取数据部分主要分为三种模式, 不同的模式对应的不同的数据集:\n",
+ "1. debug模式: 这个的目的是帮助我们基于数据先搭建一个简易的baseline并跑通, 保证写的baseline代码没有什么问题。 由于推荐比赛的数据往往非常巨大, 如果一上来直接采用全部的数据进行分析,搭建baseline框架, 往往会带来时间和设备上的损耗, **所以这时候我们往往需要从海量数据的训练集中随机抽取一部分样本来进行调试(train_click_log_sample)**, 先跑通一个baseline。\n",
+ "2. 线下验证模式: 这个的目的是帮助我们在线下基于已有的训练集数据, 来选择好合适的模型和一些超参数。 **所以我们这一块只需要加载整个训练集(train_click_log)**, 然后把整个训练集再分成训练集和验证集。 训练集是模型的训练数据, 验证集部分帮助我们调整模型的参数和其他的一些超参数。\n",
+ "3. 线上模式: 我们用debug模式搭建起一个推荐系统比赛的baseline, 用线下验证模式选择好了模型和一些超参数, 这一部分就是真正的对于给定的测试集进行预测, 提交到线上, **所以这一块使用的训练数据集是全量的数据集(train_click_log+test_click_log)**\n",
+ "\n",
+ "下面就分别对这三种不同的数据读取模式先建立不同的代导入函数, 方便后面针对不同的模式下导入数据。"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "youtubednn_recall...\n",
- "youtubednn_usercf_recall...\n"
- ]
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:34.476240Z",
+ "start_time": "2020-11-16T11:26:34.467352Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# debug模式: 从训练集中划出一部分数据来调试代码\n",
+ "def get_all_click_sample(data_path, sample_nums=10000):\n",
+ " \"\"\"\n",
+ " 训练集中采样一部分数据调试\n",
+ " data_path: 原数据的存储路径\n",
+ " sample_nums: 采样数目(这里由于机器的内存限制,可以采样用户做)\n",
+ " \"\"\"\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " all_user_ids = all_click.user_id.unique()\n",
+ "\n",
+ " sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False) \n",
+ " all_click = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click\n",
+ "\n",
+ "# 读取点击数据,这里分成线上和线下,如果是为了获取线上提交结果应该讲测试集中的点击数据合并到总的数据中\n",
+ "# 如果是为了线下验证模型的有效性或者特征的有效性,可以只使用训练集\n",
+ "def get_all_click_df(data_path='./data_raw/', offline=True):\n",
+ " if offline:\n",
+ " all_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " else:\n",
+ " trn_click = pd.read_csv(data_path + 'train_click_log.csv')\n",
+ " tst_click = pd.read_csv(data_path + 'testA_click_log.csv')\n",
+ "\n",
+ " all_click = trn_click.append(tst_click)\n",
+ " \n",
+ " all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))\n",
+ " return all_click"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- " 80%|████████ | 4/5 [00:23<00:06, 6.98s/it]"
- ]
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:35.168738Z",
+ "start_time": "2020-11-16T11:26:35.163210Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取文章的基本属性\n",
+ "def get_item_info_df(data_path):\n",
+ " item_info_df = pd.read_csv(data_path + 'articles.csv')\n",
+ " \n",
+ " # 为了方便与训练集中的click_article_id拼接,需要把article_id修改成click_article_id\n",
+ " item_info_df = item_info_df.rename(columns={'article_id': 'click_article_id'})\n",
+ " \n",
+ " return item_info_df"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "cold_start_recall...\n"
- ]
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:36.152958Z",
+ "start_time": "2020-11-16T11:26:36.146324Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取文章的Embedding数据\n",
+ "def get_item_emb_dict(data_path):\n",
+ " item_emb_df = pd.read_csv(data_path + 'articles_emb.csv')\n",
+ " \n",
+ " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
+ " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols])\n",
+ " # 进行归一化\n",
+ " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
+ "\n",
+ " item_emb_dict = dict(zip(item_emb_df['article_id'], item_emb_np))\n",
+ " pickle.dump(item_emb_dict, open(save_path + 'item_content_emb.pkl', 'wb'))\n",
+ " \n",
+ " return item_emb_dict"
+ ]
},
{
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 5/5 [00:42<00:00, 8.40s/it]\n"
- ]
- }
- ],
- "source": [
- "# 最终合并之后每个用户召回150个商品进行排序\n",
- "final_recall_items_dict_rank = combine_recall_results(user_multi_recall_dict, weight_dict, topk=150)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 总结\n",
- "\n",
- "上述实现了如下召回策略:\n",
- "\n",
- "1. 基于关联规则的itemcf\n",
- "2. 基于关联规则的usercf\n",
- "3. youtubednn召回\n",
- "4. 冷启动召回\n",
- "\n",
- "对于上述实现的召回策略其实都不是最优的结果,我们只是做了个简单的尝试,其中还有很多地方可以优化,包括已经实现的这些召回策略的参数或者新加一些,修改一些关联规则都可以。当然还可以尝试更多的召回策略,比如对新闻进行热度召回等等。\n",
- "\n",
- "\n",
- "\n",
- "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "nbTranslate": {
- "displayLangs": [
- "*"
- ],
- "hotkey": "alt-t",
- "langInMainMenu": true,
- "sourceLang": "en",
- "targetLang": "fr",
- "useGoogleTranslate": true
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:37.333536Z",
+ "start_time": "2020-11-16T11:26:37.329545Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:42.163494Z",
+ "start_time": "2020-11-16T11:26:38.018094Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 采样数据\n",
+ "# all_click_df = get_all_click_sample(data_path)\n",
+ "\n",
+ "# 全量训练集\n",
+ "all_click_df = get_all_click_df(offline=False)\n",
+ "\n",
+ "# 对时间戳进行归一化,用于在关联规则的时候计算权重\n",
+ "all_click_df['click_timestamp'] = all_click_df[['click_timestamp']].apply(max_min_scaler)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:26:44.343500Z",
+ "start_time": "2020-11-16T11:26:44.113891Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "item_info_df = get_item_info_df(data_path)"
+ ]
+ },
{
- "id": "83580",
- "title": "零基础入门推荐系统 - 新闻推荐"
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:24.295343Z",
+ "start_time": "2020-11-16T11:26:44.398007Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "item_emb_dict = get_item_emb_dict(data_path)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 工具函数"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取用户-文章-时间函数\n",
+ "这个在基于关联规则的用户协同过滤的时候会用到"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:33.791656Z",
+ "start_time": "2020-11-16T11:27:33.784305Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ "def get_user_item_time(click_df):\n",
+ " \n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " \n",
+ " def make_item_time_pair(df):\n",
+ " return list(zip(df['click_article_id'], df['click_timestamp']))\n",
+ " \n",
+ " user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply(lambda x: make_item_time_pair(x))\\\n",
+ " .reset_index().rename(columns={0: 'item_time_list'})\n",
+ " user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))\n",
+ " \n",
+ " return user_item_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取文章-用户-时间函数\n",
+ "这个在基于关联规则的文章协同过滤的时候会用到"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:38.327581Z",
+ "start_time": "2020-11-16T11:27:38.321059Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 根据时间获取商品被点击的用户序列 {item1: [(user1, time1), (user2, time2)...]...}\n",
+ "# 这里的时间是用户点击当前商品的时间,好像没有直接的关系。\n",
+ "def get_item_user_time_dict(click_df):\n",
+ " def make_user_time_pair(df):\n",
+ " return list(zip(df['user_id'], df['click_timestamp']))\n",
+ " \n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " item_user_time_df = click_df.groupby('click_article_id')['user_id', 'click_timestamp'].apply(lambda x: make_user_time_pair(x))\\\n",
+ " .reset_index().rename(columns={0: 'user_time_list'})\n",
+ " \n",
+ " item_user_time_dict = dict(zip(item_user_time_df['click_article_id'], item_user_time_df['user_time_list']))\n",
+ " return item_user_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取历史和最后一次点击\n",
+ "这个在评估召回结果, 特征工程和制作标签转成监督学习测试集的时候回用到"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:50.894683Z",
+ "start_time": "2020-11-16T11:27:50.888002Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取当前数据的历史点击和最后一次点击\n",
+ "def get_hist_and_last_click(all_click):\n",
+ " \n",
+ " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
+ " click_last_df = all_click.groupby('user_id').tail(1)\n",
+ "\n",
+ " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
+ " def hist_func(user_df):\n",
+ " if len(user_df) == 1:\n",
+ " return user_df\n",
+ " else:\n",
+ " return user_df[:-1]\n",
+ "\n",
+ " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
+ "\n",
+ " return click_hist_df, click_last_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取文章属性特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:55.893810Z",
+ "start_time": "2020-11-16T11:27:55.887623Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取文章id对应的基本属性,保存成字典的形式,方便后面召回阶段,冷启动阶段直接使用\n",
+ "def get_item_info_dict(item_info_df):\n",
+ " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
+ " item_info_df['created_at_ts'] = item_info_df[['created_at_ts']].apply(max_min_scaler)\n",
+ " \n",
+ " item_type_dict = dict(zip(item_info_df['click_article_id'], item_info_df['category_id']))\n",
+ " item_words_dict = dict(zip(item_info_df['click_article_id'], item_info_df['words_count']))\n",
+ " item_created_time_dict = dict(zip(item_info_df['click_article_id'], item_info_df['created_at_ts']))\n",
+ " \n",
+ " return item_type_dict, item_words_dict, item_created_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-13T06:42:38.730939Z",
+ "start_time": "2020-11-13T06:42:38.728461Z"
+ }
+ },
+ "source": [
+ "### 获取用户历史点击的文章信息"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:27:59.650781Z",
+ "start_time": "2020-11-16T11:27:59.640572Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_user_hist_item_info_dict(all_click):\n",
+ " \n",
+ " # 获取user_id对应的用户历史点击文章类型的集合字典\n",
+ " user_hist_item_typs = all_click.groupby('user_id')['category_id'].agg(set).reset_index()\n",
+ " user_hist_item_typs_dict = dict(zip(user_hist_item_typs['user_id'], user_hist_item_typs['category_id']))\n",
+ " \n",
+ " # 获取user_id对应的用户点击文章的集合\n",
+ " user_hist_item_ids_dict = all_click.groupby('user_id')['click_article_id'].agg(set).reset_index()\n",
+ " user_hist_item_ids_dict = dict(zip(user_hist_item_ids_dict['user_id'], user_hist_item_ids_dict['click_article_id']))\n",
+ " \n",
+ " # 获取user_id对应的用户历史点击的文章的平均字数字典\n",
+ " user_hist_item_words = all_click.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
+ " user_hist_item_words_dict = dict(zip(user_hist_item_words['user_id'], user_hist_item_words['words_count']))\n",
+ " \n",
+ " # 获取user_id对应的用户最后一次点击的文章的创建时间\n",
+ " all_click_ = all_click.sort_values('click_timestamp')\n",
+ " user_last_item_created_time = all_click_.groupby('user_id')['created_at_ts'].apply(lambda x: x.iloc[-1]).reset_index()\n",
+ " \n",
+ " max_min_scaler = lambda x : (x-np.min(x))/(np.max(x)-np.min(x))\n",
+ " user_last_item_created_time['created_at_ts'] = user_last_item_created_time[['created_at_ts']].apply(max_min_scaler)\n",
+ " \n",
+ " user_last_item_created_time_dict = dict(zip(user_last_item_created_time['user_id'], \\\n",
+ " user_last_item_created_time['created_at_ts']))\n",
+ " \n",
+ " return user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 获取点击次数最多的topk个文章"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:28:04.761105Z",
+ "start_time": "2020-11-16T11:28:04.756419Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取近期点击最多的文章\n",
+ "def get_item_topk_click(click_df, k):\n",
+ " topk_click = click_df['click_article_id'].value_counts().index[:k]\n",
+ " return topk_click"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 定义多路召回字典"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:28:08.321506Z",
+ "start_time": "2020-11-16T11:28:07.623281Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取文章的属性信息,保存成字典的形式方便查询\n",
+ "item_type_dict, item_words_dict, item_created_time_dict = get_item_info_dict(item_info_df)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:28:13.791569Z",
+ "start_time": "2020-11-16T11:28:13.786522Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 定义一个多路召回的字典,将各路召回的结果都保存在这个字典当中\n",
+ "user_multi_recall_dict = {'itemcf_sim_itemcf_recall': {},\n",
+ " 'embedding_sim_item_recall': {},\n",
+ " 'youtubednn_recall': {},\n",
+ " 'youtubednn_usercf_recall': {}, \n",
+ " 'cold_start_recall': {}}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T05:41:12.710754Z",
+ "start_time": "2020-11-16T05:40:57.842614Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 提取最后一次点击作为召回评估,如果不需要做召回评估直接使用全量的训练集进行召回(线下验证模型)\n",
+ "# 如果不是召回评估,直接使用全量数据进行召回,不用将最后一次提取出来\n",
+ "trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 召回效果评估函数\n",
+ "做完了召回有时候也需要对当前的召回方法或者参数进行调整以达到更好的召回效果,因为召回的结果决定了最终排序的上限,下面也会提供一个召回评估的方法"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T05:41:18.579118Z",
+ "start_time": "2020-11-16T05:41:18.571887Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 依次评估召回的前10, 20, 30, 40, 50个文章中的击中率\n",
+ "def metrics_recall(user_recall_items_dict, trn_last_click_df, topk=5):\n",
+ " last_click_item_dict = dict(zip(trn_last_click_df['user_id'], trn_last_click_df['click_article_id']))\n",
+ " user_num = len(user_recall_items_dict)\n",
+ " \n",
+ " for k in range(10, topk+1, 10):\n",
+ " hit_num = 0\n",
+ " for user, item_list in user_recall_items_dict.items():\n",
+ " # 获取前k个召回的结果\n",
+ " tmp_recall_items = [x[0] for x in user_recall_items_dict[user][:k]]\n",
+ " if last_click_item_dict[user] in set(tmp_recall_items):\n",
+ " hit_num += 1\n",
+ " \n",
+ " hit_rate = round(hit_num * 1.0 / user_num, 5)\n",
+ " print(' topk: ', k, ' : ', 'hit_num: ', hit_num, 'hit_rate: ', hit_rate, 'user_num : ', user_num)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 计算相似性矩阵\n",
+ "\n",
+ "这一部分主要是通过协同过滤以及向量检索得到相似性矩阵,相似性矩阵主要分为user2user和item2item,下面依次获取基于itemcf的item2item的相似性矩阵,"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### itemcf i2i_sim\n",
+ "\n",
+ "借鉴KDD2020的去偏商品推荐,在计算item2item相似性矩阵时,使用关联规则,使得计算的文章的相似性还考虑到了:\n",
+ "1. 用户点击的时间权重\n",
+ "2. 用户点击的顺序权重\n",
+ "3. 文章创建的时间权重"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:30:51.872262Z",
+ "start_time": "2020-11-16T11:30:51.860099Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def itemcf_sim(df, item_created_time_dict):\n",
+ " \"\"\"\n",
+ " 文章与文章之间的相似性矩阵计算\n",
+ " :param df: 数据表\n",
+ " :item_created_time_dict: 文章创建时间的字典\n",
+ " return : 文章与文章的相似性矩阵\n",
+ " \n",
+ " 思路: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
+ " \"\"\"\n",
+ " \n",
+ " user_item_time_dict = get_user_item_time(df)\n",
+ " \n",
+ " # 计算物品相似度\n",
+ " i2i_sim = {}\n",
+ " item_cnt = defaultdict(int)\n",
+ " for user, item_time_list in tqdm(user_item_time_dict.items()):\n",
+ " # 在基于商品的协同过滤优化的时候可以考虑时间因素\n",
+ " for loc1, (i, i_click_time) in enumerate(item_time_list):\n",
+ " item_cnt[i] += 1\n",
+ " i2i_sim.setdefault(i, {})\n",
+ " for loc2, (j, j_click_time) in enumerate(item_time_list):\n",
+ " if(i == j):\n",
+ " continue\n",
+ " \n",
+ " # 考虑文章的正向顺序点击和反向顺序点击 \n",
+ " loc_alpha = 1.0 if loc2 > loc1 else 0.7\n",
+ " # 位置信息权重,其中的参数可以调节\n",
+ " loc_weight = loc_alpha * (0.9 ** (np.abs(loc2 - loc1) - 1))\n",
+ " # 点击时间权重,其中的参数可以调节\n",
+ " click_time_weight = np.exp(0.7 ** np.abs(i_click_time - j_click_time))\n",
+ " # 两篇文章创建时间的权重,其中的参数可以调节\n",
+ " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
+ " i2i_sim[i].setdefault(j, 0)\n",
+ " # 考虑多种因素的权重计算最终的文章之间的相似度\n",
+ " i2i_sim[i][j] += loc_weight * click_time_weight * created_time_weight / math.log(len(item_time_list) + 1)\n",
+ " \n",
+ " i2i_sim_ = i2i_sim.copy()\n",
+ " for i, related_items in i2i_sim.items():\n",
+ " for j, wij in related_items.items():\n",
+ " i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])\n",
+ " \n",
+ " # 将得到的相似性矩阵保存到本地\n",
+ " pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))\n",
+ " \n",
+ " return i2i_sim_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:47:09.937002Z",
+ "start_time": "2020-11-16T11:30:57.394334Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [14:20<00:00, 290.38it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "i2i_sim = itemcf_sim(all_click_df, item_created_time_dict)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### usercf u2u_sim\n",
+ "\n",
+ "在计算用户之间的相似度的时候,也可以使用一些简单的关联规则,比如用户活跃度权重,这里将用户的点击次数作为用户活跃度的指标"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:11:14.951940Z",
+ "start_time": "2020-11-16T09:11:14.945654Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_user_activate_degree_dict(all_click_df):\n",
+ " all_click_df_ = all_click_df.groupby('user_id')['click_article_id'].count().reset_index()\n",
+ " \n",
+ " # 用户活跃度归一化\n",
+ " mm = MinMaxScaler()\n",
+ " all_click_df_['click_article_id'] = mm.fit_transform(all_click_df_[['click_article_id']])\n",
+ " user_activate_degree_dict = dict(zip(all_click_df_['user_id'], all_click_df_['click_article_id']))\n",
+ " \n",
+ " return user_activate_degree_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:11:19.879276Z",
+ "start_time": "2020-11-16T09:11:19.868808Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def usercf_sim(all_click_df, user_activate_degree_dict):\n",
+ " \"\"\"\n",
+ " 用户相似性矩阵计算\n",
+ " :param all_click_df: 数据表\n",
+ " :param user_activate_degree_dict: 用户活跃度的字典\n",
+ " return 用户相似性矩阵\n",
+ " \n",
+ " 思路: 基于用户的协同过滤(详细请参考上一期推荐系统基础的组队学习) + 关联规则\n",
+ " \"\"\"\n",
+ " item_user_time_dict = get_item_user_time_dict(all_click_df)\n",
+ " \n",
+ " u2u_sim = {}\n",
+ " user_cnt = defaultdict(int)\n",
+ " for item, user_time_list in tqdm(item_user_time_dict.items()):\n",
+ " for u, click_time in user_time_list:\n",
+ " user_cnt[u] += 1\n",
+ " u2u_sim.setdefault(u, {})\n",
+ " for v, click_time in user_time_list:\n",
+ " u2u_sim[u].setdefault(v, 0)\n",
+ " if u == v:\n",
+ " continue\n",
+ " # 用户平均活跃度作为活跃度的权重,这里的式子也可以改善\n",
+ " activate_weight = 100 * 0.5 * (user_activate_degree_dict[u] + user_activate_degree_dict[v]) \n",
+ " u2u_sim[u][v] += activate_weight / math.log(len(user_time_list) + 1)\n",
+ " \n",
+ " u2u_sim_ = u2u_sim.copy()\n",
+ " for u, related_users in u2u_sim.items():\n",
+ " for v, wij in related_users.items():\n",
+ " u2u_sim_[u][v] = wij / math.sqrt(user_cnt[u] * user_cnt[v])\n",
+ " \n",
+ " # 将得到的相似性矩阵保存到本地\n",
+ " pickle.dump(u2u_sim_, open(save_path + 'usercf_u2u_sim.pkl', 'wb'))\n",
+ "\n",
+ " return u2u_sim_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T06:59:46.701572Z",
+ "start_time": "2020-11-16T06:59:26.852246Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 由于usercf计算时候太耗费内存了,这里就不直接运行了\n",
+ "# 如果是采样的话,是可以运行的\n",
+ "user_activate_degree_dict = get_user_activate_degree_dict(all_click_df)\n",
+ "u2u_sim = usercf_sim(all_click_df, user_activate_degree_dict)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### item embedding sim\n",
+ "\n",
+ "使用Embedding计算item之间的相似度是为了后续冷启动的时候可以获取未出现在点击数据中的文章,后面有对冷启动专门的介绍,这里简单的说一下faiss。\n",
+ "\n",
+ "aiss是Facebook的AI团队开源的一套用于做聚类或者相似性搜索的软件库,底层是用C++实现。Faiss因为超级优越的性能,被广泛应用于推荐相关的业务当中.\n",
+ "\n",
+ "faiss工具包一般使用在推荐系统中的向量召回部分。在做向量召回的时候要么是u2u,u2i或者i2i,这里的u和i指的是user和item.我们知道在实际的场景中user和item的数量都是海量的,我们最容易想到的基于向量相似度的召回就是使用两层循环遍历user列表或者item列表计算两个向量的相似度,但是这样做在面对海量数据是不切实际的,faiss就是用来加速计算某个查询向量最相似的topk个索引向量。\n",
+ "\n",
+ "**faiss查询的原理:**\n",
+ "\n",
+ "faiss使用了PCA和PQ(Product quantization乘积量化)两种技术进行向量压缩和编码,当然还使用了其他的技术进行优化,但是PCA和PQ是其中最核心部分。\n",
+ "\n",
+ "1. PCA降维算法细节参考下面这个链接进行学习 \n",
+ "[主成分分析(PCA)原理总结](https://www.cnblogs.com/pinard/p/6239403.html) \n",
+ "\n",
+ "2. PQ编码的细节下面这个链接进行学习 \n",
+ "[实例理解product quantization算法](http://www.fabwrite.com/productquantization)\n",
+ "\n",
+ "**faiss使用**\n",
+ "\n",
+ "[faiss官方教程](https://github.com/facebookresearch/faiss/wiki/Getting-started)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:11:28.631803Z",
+ "start_time": "2020-11-16T09:11:28.619926Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 向量检索相似度计算\n",
+ "# topk指的是每个item, faiss搜索后返回最相似的topk个item\n",
+ "def embdding_sim(click_df, item_emb_df, save_path, topk):\n",
+ " \"\"\"\n",
+ " 基于内容的文章embedding相似性矩阵计算\n",
+ " :param click_df: 数据表\n",
+ " :param item_emb_df: 文章的embedding\n",
+ " :param save_path: 保存路径\n",
+ " :patam topk: 找最相似的topk篇\n",
+ " return 文章相似性矩阵\n",
+ " \n",
+ " 思路: 对于每一篇文章, 基于embedding的相似性返回topk个与其最相似的文章, 只不过由于文章数量太多,这里用了faiss进行加速\n",
+ " \"\"\"\n",
+ " \n",
+ " # 文章索引与文章id的字典映射\n",
+ " item_idx_2_rawid_dict = dict(zip(item_emb_df.index, item_emb_df['article_id']))\n",
+ " \n",
+ " item_emb_cols = [x for x in item_emb_df.columns if 'emb' in x]\n",
+ " item_emb_np = np.ascontiguousarray(item_emb_df[item_emb_cols].values, dtype=np.float32)\n",
+ " # 向量进行单位化\n",
+ " item_emb_np = item_emb_np / np.linalg.norm(item_emb_np, axis=1, keepdims=True)\n",
+ " \n",
+ " # 建立faiss索引\n",
+ " item_index = faiss.IndexFlatIP(item_emb_np.shape[1])\n",
+ " item_index.add(item_emb_np)\n",
+ " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
+ " sim, idx = item_index.search(item_emb_np, topk) # 返回的是列表\n",
+ " \n",
+ " # 将向量检索的结果保存成原始id的对应关系\n",
+ " item_sim_dict = collections.defaultdict(dict)\n",
+ " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(item_emb_np)), sim, idx)):\n",
+ " target_raw_id = item_idx_2_rawid_dict[target_idx]\n",
+ " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
+ " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
+ " rele_raw_id = item_idx_2_rawid_dict[rele_idx]\n",
+ " item_sim_dict[target_raw_id][rele_raw_id] = item_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
+ " \n",
+ " # 保存i2i相似度矩阵\n",
+ " pickle.dump(item_sim_dict, open(save_path + 'emb_i2i_sim.pkl', 'wb')) \n",
+ " \n",
+ " return item_sim_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T09:32:35.926116Z",
+ "start_time": "2020-11-16T09:11:44.586967Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "364047it [00:23, 15292.14it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "item_emb_df = pd.read_csv(data_path + '/articles_emb.csv')\n",
+ "emb_i2i_sim = embdding_sim(all_click_df, item_emb_df, save_path, topk=10) # topk可以自行设置"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 召回\n",
+ "这个就是我们开篇提到的那个问题, 面的36万篇文章, 20多万用户的推荐, 我们又有哪些策略来缩减问题的规模? 我们就可以再召回阶段筛选出用户对于点击文章的候选集合, 从而降低问题的规模。召回常用的策略:\n",
+ "* Youtube DNN 召回\n",
+ "* 基于文章的召回\n",
+ " * 文章的协同过滤\n",
+ " * 基于文章embedding的召回\n",
+ "* 基于用户的召回\n",
+ " * 用户的协同过滤\n",
+ " * 用户embedding\n",
+ "\n",
+ "上面的各种召回方式一部分在基于用户已经看得文章的基础上去召回与这些文章相似的一些文章, 而这个相似性的计算方式不同, 就得到了不同的召回方式, 比如文章的协同过滤, 文章内容的embedding等。还有一部分是根据用户的相似性进行推荐,对于某用户推荐与其相似的其他用户看过的文章,比如用户的协同过滤和用户embedding。 还有一种思路是类似矩阵分解的思路,先计算出用户和文章的embedding之后,就可以直接算用户和文章的相似度, 根据这个相似度进行推荐, 比如YouTube DNN。 我们下面详细来看一下每一个召回方法:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### YoutubeDNN召回\n",
+ "**(这一步是直接获取用户召回的候选文章列表)**\n",
+ "\n",
+ "[论文下载地址](https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf)\n",
+ "\n",
+ "**Youtubednn召回架构**\n",
+ "\n",
+ "![image-20201111160516562](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201111160516562.png)\n",
+ "\n",
+ "\n",
+ "\n",
+ "关于YoutubeDNN原理和应用推荐看王喆的两篇博客:\n",
+ "\n",
+ "1. [重读Youtube深度学习推荐系统论文,字字珠玑,惊为神文](https://zhuanlan.zhihu.com/p/52169807)\n",
+ "2. [YouTube深度学习推荐系统的十大工程问题](https://zhuanlan.zhihu.com/p/52504407)\n",
+ "\n",
+ "\n",
+ "**参考文献:**\n",
+ "1. https://zhuanlan.zhihu.com/p/52169807 (YouTubeDNN原理)\n",
+ "2. https://zhuanlan.zhihu.com/p/26306795 (Word2Vec知乎众赞文章) --- word2vec放到排序中的w2v的介绍部分\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:13:11.058766Z",
+ "start_time": "2020-11-16T10:13:11.041084Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取双塔召回时的训练验证数据\n",
+ "# negsample指的是通过滑窗构建样本的时候,负样本的数量\n",
+ "def gen_data_set(data, negsample=0):\n",
+ " data.sort_values(\"click_timestamp\", inplace=True)\n",
+ " item_ids = data['click_article_id'].unique()\n",
+ "\n",
+ " train_set = []\n",
+ " test_set = []\n",
+ " for reviewerID, hist in tqdm(data.groupby('user_id')):\n",
+ " pos_list = hist['click_article_id'].tolist()\n",
+ " \n",
+ " if negsample > 0:\n",
+ " candidate_set = list(set(item_ids) - set(pos_list)) # 用户没看过的文章里面选择负样本\n",
+ " neg_list = np.random.choice(candidate_set,size=len(pos_list)*negsample,replace=True) # 对于每个正样本,选择n个负样本\n",
+ " \n",
+ " # 长度只有一个的时候,需要把这条数据也放到训练集中,不然的话最终学到的embedding就会有缺失\n",
+ " if len(pos_list) == 1:\n",
+ " train_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
+ " test_set.append((reviewerID, [pos_list[0]], pos_list[0],1,len(pos_list)))\n",
+ " \n",
+ " # 滑窗构造正负样本\n",
+ " for i in range(1, len(pos_list)):\n",
+ " hist = pos_list[:i]\n",
+ " \n",
+ " if i != len(pos_list) - 1:\n",
+ " train_set.append((reviewerID, hist[::-1], pos_list[i], 1, len(hist[::-1]))) # 正样本 [user_id, his_item, pos_item, label, len(his_item)]\n",
+ " for negi in range(negsample):\n",
+ " train_set.append((reviewerID, hist[::-1], neg_list[i*negsample+negi], 0,len(hist[::-1]))) # 负样本 [user_id, his_item, neg_item, label, len(his_item)]\n",
+ " else:\n",
+ " # 将最长的那一个序列长度作为测试数据\n",
+ " test_set.append((reviewerID, hist[::-1], pos_list[i],1,len(hist[::-1])))\n",
+ " \n",
+ " random.shuffle(train_set)\n",
+ " random.shuffle(test_set)\n",
+ " \n",
+ " return train_set, test_set\n",
+ "\n",
+ "# 将输入的数据进行padding,使得序列特征的长度都一致\n",
+ "def gen_model_input(train_set,user_profile,seq_max_len):\n",
+ "\n",
+ " train_uid = np.array([line[0] for line in train_set])\n",
+ " train_seq = [line[1] for line in train_set]\n",
+ " train_iid = np.array([line[2] for line in train_set])\n",
+ " train_label = np.array([line[3] for line in train_set])\n",
+ " train_hist_len = np.array([line[4] for line in train_set])\n",
+ "\n",
+ " train_seq_pad = pad_sequences(train_seq, maxlen=seq_max_len, padding='post', truncating='post', value=0)\n",
+ " train_model_input = {\"user_id\": train_uid, \"click_article_id\": train_iid, \"hist_article_id\": train_seq_pad,\n",
+ " \"hist_len\": train_hist_len}\n",
+ "\n",
+ " return train_model_input, train_label"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:13:18.124452Z",
+ "start_time": "2020-11-16T10:13:18.098284Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def youtubednn_u2i_dict(data, topk=20): \n",
+ " sparse_features = [\"click_article_id\", \"user_id\"]\n",
+ " SEQ_LEN = 30 # 用户点击序列的长度,短的填充,长的截断\n",
+ " \n",
+ " user_profile_ = data[[\"user_id\"]].drop_duplicates('user_id')\n",
+ " item_profile_ = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
+ " \n",
+ " # 类别编码\n",
+ " features = [\"click_article_id\", \"user_id\"]\n",
+ " feature_max_idx = {}\n",
+ " \n",
+ " for feature in features:\n",
+ " lbe = LabelEncoder()\n",
+ " data[feature] = lbe.fit_transform(data[feature])\n",
+ " feature_max_idx[feature] = data[feature].max() + 1\n",
+ " \n",
+ " # 提取user和item的画像,这里具体选择哪些特征还需要进一步的分析和考虑\n",
+ " user_profile = data[[\"user_id\"]].drop_duplicates('user_id')\n",
+ " item_profile = data[[\"click_article_id\"]].drop_duplicates('click_article_id') \n",
+ " \n",
+ " user_index_2_rawid = dict(zip(user_profile['user_id'], user_profile_['user_id']))\n",
+ " item_index_2_rawid = dict(zip(item_profile['click_article_id'], item_profile_['click_article_id']))\n",
+ " \n",
+ " # 划分训练和测试集\n",
+ " # 由于深度学习需要的数据量通常都是非常大的,所以为了保证召回的效果,往往会通过滑窗的形式扩充训练样本\n",
+ " train_set, test_set = gen_data_set(data, 0)\n",
+ " # 整理输入数据,具体的操作可以看上面的函数\n",
+ " train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)\n",
+ " test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)\n",
+ " \n",
+ " # 确定Embedding的维度\n",
+ " embedding_dim = 16\n",
+ " \n",
+ " # 将数据整理成模型可以直接输入的形式\n",
+ " user_feature_columns = [SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),\n",
+ " VarLenSparseFeat(SparseFeat('hist_article_id', feature_max_idx['click_article_id'], embedding_dim,\n",
+ " embedding_name=\"click_article_id\"), SEQ_LEN, 'mean', 'hist_len'),]\n",
+ " item_feature_columns = [SparseFeat('click_article_id', feature_max_idx['click_article_id'], embedding_dim)]\n",
+ " \n",
+ " # 模型的定义 \n",
+ " # num_sampled: 负采样时的样本数量\n",
+ " model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, embedding_dim))\n",
+ " # 模型编译\n",
+ " model.compile(optimizer=\"adam\", loss=sampledsoftmaxloss) \n",
+ " \n",
+ " # 模型训练,这里可以定义验证集的比例,如果设置为0的话就是全量数据直接进行训练\n",
+ " history = model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1, validation_split=0.0)\n",
+ " \n",
+ " # 训练完模型之后,提取训练的Embedding,包括user端和item端\n",
+ " test_user_model_input = test_model_input\n",
+ " all_item_model_input = {\"click_article_id\": item_profile['click_article_id'].values}\n",
+ "\n",
+ " user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)\n",
+ " item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)\n",
+ " \n",
+ " # 保存当前的item_embedding 和 user_embedding 排序的时候可能能够用到,但是需要注意保存的时候需要和原始的id对应\n",
+ " user_embs = user_embedding_model.predict(test_user_model_input, batch_size=2 ** 12)\n",
+ " item_embs = item_embedding_model.predict(all_item_model_input, batch_size=2 ** 12)\n",
+ " \n",
+ " # embedding保存之前归一化一下\n",
+ " user_embs = user_embs / np.linalg.norm(user_embs, axis=1, keepdims=True)\n",
+ " item_embs = item_embs / np.linalg.norm(item_embs, axis=1, keepdims=True)\n",
+ " \n",
+ " # 将Embedding转换成字典的形式方便查询\n",
+ " raw_user_id_emb_dict = {user_index_2_rawid[k]: \\\n",
+ " v for k, v in zip(user_profile['user_id'], user_embs)}\n",
+ " raw_item_id_emb_dict = {item_index_2_rawid[k]: \\\n",
+ " v for k, v in zip(item_profile['click_article_id'], item_embs)}\n",
+ " # 将Embedding保存到本地\n",
+ " pickle.dump(raw_user_id_emb_dict, open(save_path + 'user_youtube_emb.pkl', 'wb'))\n",
+ " pickle.dump(raw_item_id_emb_dict, open(save_path + 'item_youtube_emb.pkl', 'wb'))\n",
+ " \n",
+ " # faiss紧邻搜索,通过user_embedding 搜索与其相似性最高的topk个item\n",
+ " index = faiss.IndexFlatIP(embedding_dim)\n",
+ " # 上面已经进行了归一化,这里可以不进行归一化了\n",
+ "# faiss.normalize_L2(user_embs)\n",
+ "# faiss.normalize_L2(item_embs)\n",
+ " index.add(item_embs) # 将item向量构建索引\n",
+ " sim, idx = index.search(np.ascontiguousarray(user_embs), topk) # 通过user去查询最相似的topk个item\n",
+ " \n",
+ " user_recall_items_dict = collections.defaultdict(dict)\n",
+ " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(test_user_model_input['user_id'], sim, idx)):\n",
+ " target_raw_id = user_index_2_rawid[target_idx]\n",
+ " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
+ " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
+ " rele_raw_id = item_index_2_rawid[rele_idx]\n",
+ " user_recall_items_dict[target_raw_id][rele_raw_id] = user_recall_items_dict.get(target_raw_id, {})\\\n",
+ " .get(rele_raw_id, 0) + sim_value\n",
+ " \n",
+ " user_recall_items_dict = {k: sorted(v.items(), key=lambda x: x[1], reverse=True) for k, v in user_recall_items_dict.items()}\n",
+ " # 将召回的结果进行排序\n",
+ " \n",
+ " # 保存召回的结果\n",
+ " # 这里是直接通过向量的方式得到了召回结果,相比于上面的召回方法,上面的只是得到了i2i及u2u的相似性矩阵,还需要进行协同过滤召回才能得到召回结果\n",
+ " # 可以直接对这个召回结果进行评估,为了方便可以统一写一个评估函数对所有的召回结果进行评估\n",
+ " pickle.dump(user_recall_items_dict, open(save_path + 'youtube_u2i_dict.pkl', 'wb'))\n",
+ " return user_recall_items_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T10:21:46.420014Z",
+ "start_time": "2020-11-16T10:13:35.351131Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [02:02<00:00, 2038.57it/s]\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: calling reduce_sum_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "keep_dims is deprecated, use keepdims instead\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:253: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Deprecated in favor of operator or tf.math.divide.\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
+ "1149673/1149673 [==============================] - 216s 188us/sample - loss: 0.1326\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "250000it [00:32, 7720.75it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 由于这里需要做召回评估,所以讲训练集中的最后一次点击都提取了出来\n",
+ "if not metric_recall:\n",
+ " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(all_click_df, topk=20)\n",
+ "else:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ " user_multi_recall_dict['youtubednn_recall'] = youtubednn_u2i_dict(trn_hist_click_df, topk=20)\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['youtubednn_recall'], trn_last_click_df, topk=20)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### itemcf recall\n",
+ "\n",
+ "上面已经通过协同过滤,Embedding检索的方式得到了文章的相似度矩阵,下面使用协同过滤的思想,给用户召回与其历史文章相似的文章。\n",
+ "这里在召回的时候,也是用了关联规则的方式:\n",
+ "1. 考虑相似文章与历史点击文章顺序的权重(细节看代码)\n",
+ "2. 考虑文章创建时间的权重,也就是考虑相似文章与历史点击文章创建时间差的权重\n",
+ "3. 考虑文章内容相似度权重(使用Embedding计算相似文章相似度,但是这里需要注意,在Embedding的时候并没有计算所有商品两两之间的相似度,所以相似的文章与历史点击文章不存在相似度,需要做特殊处理)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T11:48:40.580553Z",
+ "start_time": "2020-11-16T11:48:40.567130Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于商品的召回i2i\n",
+ "def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
+ " \"\"\"\n",
+ " 基于文章协同过滤的召回\n",
+ " :param user_id: 用户id\n",
+ " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ " :param i2i_sim: 字典,文章相似性矩阵\n",
+ " :param sim_item_topk: 整数, 选择与当前文章最相似的前k篇文章\n",
+ " :param recall_item_num: 整数, 最后的召回文章数量\n",
+ " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
+ " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
+ " \n",
+ " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
+ " \"\"\"\n",
+ " # 获取用户历史交互的文章\n",
+ " user_hist_items = user_item_time_dict[user_id]\n",
+ " user_hist_items_ = {user_id for user_id, _ in user_hist_items}\n",
+ " \n",
+ " item_rank = {}\n",
+ " for loc, (i, click_time) in enumerate(user_hist_items):\n",
+ " for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:\n",
+ " if j in user_hist_items_:\n",
+ " continue\n",
+ " \n",
+ " # 文章创建时间差权重\n",
+ " created_time_weight = np.exp(0.8 ** np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
+ " # 相似文章和历史点击文章序列中历史文章所在的位置权重\n",
+ " loc_weight = (0.9 ** (len(user_hist_items) - loc))\n",
+ " \n",
+ " content_weight = 1.0\n",
+ " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
+ " content_weight += emb_i2i_sim[i][j]\n",
+ " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
+ " content_weight += emb_i2i_sim[j][i]\n",
+ " \n",
+ " item_rank.setdefault(j, 0)\n",
+ " item_rank[j] += created_time_weight * loc_weight * content_weight * wij\n",
+ " \n",
+ " # 不足10个,用热门商品补全\n",
+ " if len(item_rank) < recall_item_num:\n",
+ " for i, item in enumerate(item_topk_click):\n",
+ " if item in item_rank.items(): # 填充的item应该不在原来的列表中\n",
+ " continue\n",
+ " item_rank[item] = - i - 100 # 随便给个负数就行\n",
+ " if len(item_rank) == recall_item_num:\n",
+ " break\n",
+ " \n",
+ " item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]\n",
+ " \n",
+ " return item_rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### itemcf sim召回"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T14:41:23.433038Z",
+ "start_time": "2020-11-16T11:48:46.286350Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [2:51:13<00:00, 24.33it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 先进行itemcf召回, 为了召回评估,所以提取最后一次点击\n",
+ "\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "\n",
+ "i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))\n",
+ "emb_i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl', 'rb'))\n",
+ "\n",
+ "sim_item_topk = 20\n",
+ "recall_item_num = 10\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, \\\n",
+ " i2i_sim, sim_item_topk, recall_item_num, \\\n",
+ " item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
+ "\n",
+ "user_multi_recall_dict['itemcf_sim_itemcf_recall'] = user_recall_items_dict\n",
+ "pickle.dump(user_multi_recall_dict['itemcf_sim_itemcf_recall'], open(save_path + 'itemcf_recall_dict.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['itemcf_sim_itemcf_recall'], trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### embedding sim 召回"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T15:04:51.527795Z",
+ "start_time": "2020-11-16T14:59:03.907519Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [04:35<00:00, 905.85it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 这里是为了召回评估,所以提取最后一次点击\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
+ "\n",
+ "sim_item_topk = 20\n",
+ "recall_item_num = 10\n",
+ "\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
+ " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
+ " \n",
+ "user_multi_recall_dict['embedding_sim_item_recall'] = user_recall_items_dict\n",
+ "pickle.dump(user_multi_recall_dict['embedding_sim_item_recall'], open(save_path + 'embedding_sim_item_recall.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['embedding_sim_item_recall'], trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### usercf召回\n",
+ "\n",
+ "基于用户协同过滤,核心思想是给用户推荐与其相似的用户历史点击文章,因为这里涉及到了相似用户的历史文章,这里仍然可以加上一些关联规则来给用户可能点击的文章进行加权,这里使用的关联规则主要是考虑相似用户的历史点击文章与被推荐用户历史点击商品的关系权重,而这里的关系就可以直接借鉴基于物品的协同过滤相似的做法,只不过这里是对被推荐物品关系的一个累加的过程,下面是使用的一些关系权重,及相关的代码:\n",
+ "\n",
+ "1. 计算被推荐用户历史点击文章与相似用户历史点击文章的相似度,文章创建时间差,相对位置的总和,作为各自的权重"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:09:32.293990Z",
+ "start_time": "2020-11-17T02:09:32.278678Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于用户的召回 u2u2i\n",
+ "def user_based_recommend(user_id, user_item_time_dict, u2u_sim, sim_user_topk, recall_item_num, \n",
+ " item_topk_click, item_created_time_dict, emb_i2i_sim):\n",
+ " \"\"\"\n",
+ " 基于文章协同过滤的召回\n",
+ " :param user_id: 用户id\n",
+ " :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列 {user1: [(item1, time1), (item2, time2)..]...}\n",
+ " :param u2u_sim: 字典,文章相似性矩阵\n",
+ " :param sim_user_topk: 整数, 选择与当前用户最相似的前k个用户\n",
+ " :param recall_item_num: 整数, 最后的召回文章数量\n",
+ " :param item_topk_click: 列表,点击次数最多的文章列表,用户召回补全\n",
+ " :param item_created_time_dict: 文章创建时间列表\n",
+ " :param emb_i2i_sim: 字典基于内容embedding算的文章相似矩阵\n",
+ " \n",
+ " return: 召回的文章列表 [(item1, score1), (item2, score2)...]\n",
+ " \"\"\"\n",
+ " # 历史交互\n",
+ " user_item_time_list = user_item_time_dict[user_id] # [(item1, time1), (item2, time2)..]\n",
+ " user_hist_items = set([i for i, t in user_item_time_list]) # 存在一个用户与某篇文章的多次交互, 这里得去重\n",
+ " \n",
+ " items_rank = {}\n",
+ " for sim_u, wuv in sorted(u2u_sim[user_id].items(), key=lambda x: x[1], reverse=True)[:sim_user_topk]:\n",
+ " for i, click_time in user_item_time_dict[sim_u]:\n",
+ " if i in user_hist_items:\n",
+ " continue\n",
+ " items_rank.setdefault(i, 0)\n",
+ " \n",
+ " loc_weight = 1.0\n",
+ " content_weight = 1.0\n",
+ " created_time_weight = 1.0\n",
+ " \n",
+ " # 当前文章与该用户看的历史文章进行一个权重交互\n",
+ " for loc, (j, click_time) in enumerate(user_item_time_list):\n",
+ " # 点击时的相对位置权重\n",
+ " loc_weight += 0.9 ** (len(user_item_time_list) - loc)\n",
+ " # 内容相似性权重\n",
+ " if emb_i2i_sim.get(i, {}).get(j, None) is not None:\n",
+ " content_weight += emb_i2i_sim[i][j]\n",
+ " if emb_i2i_sim.get(j, {}).get(i, None) is not None:\n",
+ " content_weight += emb_i2i_sim[j][i]\n",
+ " \n",
+ " # 创建时间差权重\n",
+ " created_time_weight += np.exp(0.8 * np.abs(item_created_time_dict[i] - item_created_time_dict[j]))\n",
+ " \n",
+ " items_rank[i] += loc_weight * content_weight * created_time_weight * wuv\n",
+ " \n",
+ " # 热度补全\n",
+ " if len(items_rank) < recall_item_num:\n",
+ " for i, item in enumerate(item_topk_click):\n",
+ " if item in items_rank.items(): # 填充的item应该不在原来的列表中\n",
+ " continue\n",
+ " items_rank[item] = - i - 100 # 随便给个复数就行\n",
+ " if len(items_rank) == recall_item_num:\n",
+ " break\n",
+ " \n",
+ " items_rank = sorted(items_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num] \n",
+ " \n",
+ " return items_rank"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### usercf sim召回"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:05:41.652501Z",
+ "start_time": "2020-11-16T07:05:40.953871Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 这里是为了召回评估,所以提取最后一次点击\n",
+ "# 由于usercf中计算user之间的相似度的过程太费内存了,全量数据这里就没有跑,跑了一个采样之后的数据\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ " \n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "\n",
+ "u2u_sim = pickle.load(open(save_path + 'usercf_u2u_sim.pkl', 'rb'))\n",
+ "\n",
+ "sim_user_topk = 20\n",
+ "recall_item_num = 10\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
+ " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim) \n",
+ "\n",
+ "pickle.dump(user_recall_items_dict, open(save_path + 'usercf_u2u2i_recall.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_recall_items_dict, trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T03:09:35.853516Z",
+ "start_time": "2020-11-16T03:09:35.737625Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### user embedding sim召回\n",
+ "\n",
+ "虽然没有直接跑usercf的计算用户之间的相似度,为了验证上述基于用户的协同过滤的代码,下面使用了YoutubeDNN过程中产生的user embedding来进行向量检索每个user最相似的topk个user,在使用这里得到的u2u的相似性矩阵,使用usercf进行召回,具体代码如下"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:09:46.807811Z",
+ "start_time": "2020-11-17T02:09:46.798033Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 使用Embedding的方式获取u2u的相似性矩阵\n",
+ "# topk指的是每个user, faiss搜索后返回最相似的topk个user\n",
+ "def u2u_embdding_sim(click_df, user_emb_dict, save_path, topk):\n",
+ " \n",
+ " user_list = []\n",
+ " user_emb_list = []\n",
+ " for user_id, user_emb in user_emb_dict.items():\n",
+ " user_list.append(user_id)\n",
+ " user_emb_list.append(user_emb)\n",
+ " \n",
+ " user_index_2_rawid_dict = {k: v for k, v in zip(range(len(user_list)), user_list)} \n",
+ " \n",
+ " user_emb_np = np.array(user_emb_list, dtype=np.float32)\n",
+ " \n",
+ " # 建立faiss索引\n",
+ " user_index = faiss.IndexFlatIP(user_emb_np.shape[1])\n",
+ " user_index.add(user_emb_np)\n",
+ " # 相似度查询,给每个索引位置上的向量返回topk个item以及相似度\n",
+ " sim, idx = user_index.search(user_emb_np, topk) # 返回的是列表\n",
+ " \n",
+ " # 将向量检索的结果保存成原始id的对应关系\n",
+ " user_sim_dict = collections.defaultdict(dict)\n",
+ " for target_idx, sim_value_list, rele_idx_list in tqdm(zip(range(len(user_emb_np)), sim, idx)):\n",
+ " target_raw_id = user_index_2_rawid_dict[target_idx]\n",
+ " # 从1开始是为了去掉商品本身, 所以最终获得的相似商品只有topk-1\n",
+ " for rele_idx, sim_value in zip(rele_idx_list[1:], sim_value_list[1:]): \n",
+ " rele_raw_id = user_index_2_rawid_dict[rele_idx]\n",
+ " user_sim_dict[target_raw_id][rele_raw_id] = user_sim_dict.get(target_raw_id, {}).get(rele_raw_id, 0) + sim_value\n",
+ " \n",
+ " # 保存i2i相似度矩阵\n",
+ " pickle.dump(user_sim_dict, open(save_path + 'youtube_u2u_sim.pkl', 'wb')) \n",
+ " return user_sim_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:14:31.355905Z",
+ "start_time": "2020-11-17T02:09:53.236531Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "250000it [00:23, 10507.45it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 读取YoutubeDNN过程中产生的user embedding, 然后使用faiss计算用户之间的相似度\n",
+ "# 这里需要注意,这里得到的user embedding其实并不是很好,因为YoutubeDNN中使用的是用户点击序列来训练的user embedding,\n",
+ "# 如果序列普遍都比较短的话,其实效果并不是很好\n",
+ "user_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
+ "u2u_sim = u2u_embdding_sim(all_click_df, user_emb_dict, save_path, topk=10)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "通过YoutubeDNN得到的user_embedding"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:49:40.755431Z",
+ "start_time": "2020-11-17T02:28:47.003514Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [19:43<00:00, 211.22it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 使用召回评估函数验证当前召回方式的效果\n",
+ "if metric_recall:\n",
+ " trn_hist_click_df, trn_last_click_df = get_hist_and_last_click(all_click_df)\n",
+ "else:\n",
+ " trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "u2u_sim = pickle.load(open(save_path + 'youtube_u2u_sim.pkl', 'rb'))\n",
+ "\n",
+ "sim_user_topk = 20\n",
+ "recall_item_num = 10\n",
+ "\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = user_based_recommend(user, user_item_time_dict, u2u_sim, sim_user_topk, \\\n",
+ " recall_item_num, item_topk_click, item_created_time_dict, emb_i2i_sim)\n",
+ " \n",
+ "user_multi_recall_dict['youtubednn_usercf_recall'] = user_recall_items_dict\n",
+ "pickle.dump(user_multi_recall_dict['youtubednn_usercf_recall'], open(save_path + 'youtubednn_usercf_recall.pkl', 'wb'))\n",
+ "\n",
+ "if metric_recall:\n",
+ " # 召回效果评估\n",
+ " metrics_recall(user_multi_recall_dict['youtubednn_usercf_recall'], trn_last_click_df, topk=recall_item_num)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:07:44.326253Z",
+ "start_time": "2020-11-16T07:07:43.798931Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 冷启动问题"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "**冷启动问题可以分成三类:文章冷启动,用户冷启动,系统冷启动。**\n",
+ "\n",
+ "- 文章冷启动:对于一个平台系统新加入的文章,该文章没有任何的交互记录,如何推荐给用户的问题。(对于我们场景可以认为是,日志数据中没有出现过的文章都可以认为是冷启动的文章)\n",
+ "- 用户冷启动:对于一个平台系统新来的用户,该用户还没有文章的交互信息,如何给该用户进行推荐。(对于我们场景就是,测试集中的用户是否在测试集对应的log数据中出现过,如果没有出现过,那么可以认为该用户是冷启动用户。但是有时候并没有这么严格,我们也可以自己设定某些指标来判别哪些用户是冷启动用户,比如通过使用时长,点击率,留存率等等)\n",
+ "- 系统冷启动:就是对于一个平台刚上线,还没有任何的相关历史数据,此时就是系统冷启动,其实也就是前面两种的一个综合。\n",
+ "\n",
+ "**当前场景下冷启动问题的分析:**\n",
+ "\n",
+ "对当前的数据进行分析会发现,日志中所有出现过的点击文章只有3w多个,而整个文章库中却有30多万,那么测试集中的用户最后一次点击是否会点击没有出现在日志中的文章呢?如果存在这种情况,说明用户点击的文章之前没有任何的交互信息,这也就是我们所说的文章冷启动。通过数据分析还可以发现,测试集用户只有一次点击的数据占得比例还不少,其实仅仅通过用户的一次点击就给用户推荐文章使用模型的方式也是比较难的,这里其实也可以考虑用户冷启动的问题,但是这里只给出物品冷启动的一些解决方案及代码,关于用户冷启动的话提一些可行性的做法。\n",
+ "\n",
+ "1. 文章冷启动(没有冷启动的探索问题) \n",
+ " 其实我们这里不是为了做文章的冷启动而做冷启动,而是猜测用户可能会点击一些没有在log数据中出现的文章,我们要做的就是如何从将近27万的文章中选择一些文章作为用户冷启动的文章,这里其实也可以看成是一种召回策略,我们这里就采用简单的比较好理解的基于规则的召回策略来获取用户可能点击的未出现在log数据中的文章。\n",
+ " 现在的问题变成了:如何给每个用户考虑从27万个商品中获取一小部分商品?随机选一些可能是一种方案。下面给出一些参考的方案。\n",
+ " 1. 首先基于Embedding召回一部分与用户历史相似的文章\n",
+ " 2. 从基于Embedding召回的文章中通过一些规则过滤掉一些文章,使得留下的文章用户更可能点击。我们这里的规则,可以是,留下那些与用户历史点击文章主题相同的文章,或者字数相差不大的文章。并且留下的文章尽量是与测试集用户最后一次点击时间更接近的文章,或者是当天的文章也行。\n",
+ "2. 用户冷启动 \n",
+ " 这里对测试集中的用户点击数据进行分析会发现,测试集中有百分之20的用户只有一次点击,那么这些点击特别少的用户的召回是不是可以单独做一些策略上的补充呢?或者是在排序后直接基于规则加上一些文章呢?这些都可以去尝试,这里没有提供具体的做法。\n",
+ " \n",
+ "**注意:** \n",
+ "\n",
+ "这里看似和基于embedding计算的item之间相似度然后做itemcf是一致的,但是现在我们的目的不一样,我们这里的目的是找到相似的向量,并且还没有出现在log日志中的商品,再加上一些其他的冷启动的策略,这里需要找回的数量会偏多一点,不然被筛选完之后可能都没有文章了"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T04:30:23.027164Z",
+ "start_time": "2020-11-17T04:23:09.960235Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [05:01<00:00, 828.60it/s] \n"
+ ]
+ }
+ ],
+ "source": [
+ "# 先进行itemcf召回,这里不需要做召回评估,这里只是一种策略\n",
+ "trn_hist_click_df = all_click_df\n",
+ "\n",
+ "user_recall_items_dict = collections.defaultdict(dict)\n",
+ "user_item_time_dict = get_user_item_time(trn_hist_click_df)\n",
+ "i2i_sim = pickle.load(open(save_path + 'emb_i2i_sim.pkl','rb'))\n",
+ "\n",
+ "sim_item_topk = 150\n",
+ "recall_item_num = 100 # 稍微召回多一点文章,便于后续的规则筛选\n",
+ "\n",
+ "item_topk_click = get_item_topk_click(trn_hist_click_df, k=50)\n",
+ "for user in tqdm(trn_hist_click_df['user_id'].unique()):\n",
+ " user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, sim_item_topk, \n",
+ " recall_item_num, item_topk_click,item_created_time_dict, emb_i2i_sim)\n",
+ "pickle.dump(user_recall_items_dict, open(save_path + 'cold_start_items_raw_dict.pkl', 'wb'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:11:39.267581Z",
+ "start_time": "2020-11-17T06:11:39.252563Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 基于规则进行文章过滤\n",
+ "# 保留文章主题与用户历史浏览主题相似的文章\n",
+ "# 保留文章字数与用户历史浏览文章字数相差不大的文章\n",
+ "# 保留最后一次点击当天的文章\n",
+ "# 按照相似度返回最终的结果\n",
+ "\n",
+ "def get_click_article_ids_set(all_click_df):\n",
+ " return set(all_click_df.click_article_id.values)\n",
+ "\n",
+ "def cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
+ " user_last_item_created_time_dict, item_type_dict, item_words_dict, \n",
+ " item_created_time_dict, click_article_ids_set, recall_item_num):\n",
+ " \"\"\"\n",
+ " 冷启动的情况下召回一些文章\n",
+ " :param user_recall_items_dict: 基于内容embedding相似性召回来的很多文章, 字典, {user1: [(item1, item2), ..], }\n",
+ " :param user_hist_item_typs_dict: 字典, 用户点击的文章的主题映射\n",
+ " :param user_hist_item_words_dict: 字典, 用户点击的历史文章的字数映射\n",
+ " :param user_last_item_created_time_idct: 字典,用户点击的历史文章创建时间映射\n",
+ " :param item_tpye_idct: 字典,文章主题映射\n",
+ " :param item_words_dict: 字典,文章字数映射\n",
+ " :param item_created_time_dict: 字典, 文章创建时间映射\n",
+ " :param click_article_ids_set: 集合,用户点击过得文章, 也就是日志里面出现过的文章\n",
+ " :param recall_item_num: 召回文章的数量, 这个指的是没有出现在日志里面的文章数量\n",
+ " \"\"\"\n",
+ " \n",
+ " cold_start_user_items_dict = {}\n",
+ " for user, item_list in tqdm(user_recall_items_dict.items()):\n",
+ " cold_start_user_items_dict.setdefault(user, [])\n",
+ " for item, score in item_list:\n",
+ " # 获取历史文章信息\n",
+ " hist_item_type_set = user_hist_item_typs_dict[user]\n",
+ " hist_mean_words = user_hist_item_words_dict[user]\n",
+ " hist_last_item_created_time = user_last_item_created_time_dict[user]\n",
+ " hist_last_item_created_time = datetime.fromtimestamp(hist_last_item_created_time)\n",
+ " \n",
+ " # 获取当前召回文章的信息\n",
+ " curr_item_type = item_type_dict[item]\n",
+ " curr_item_words = item_words_dict[item]\n",
+ " curr_item_created_time = item_created_time_dict[item]\n",
+ " curr_item_created_time = datetime.fromtimestamp(curr_item_created_time)\n",
+ "\n",
+ " # 首先,文章不能出现在用户的历史点击中, 然后根据文章主题,文章单词数,文章创建时间进行筛选\n",
+ " if curr_item_type not in hist_item_type_set or \\\n",
+ " item in click_article_ids_set or \\\n",
+ " abs(curr_item_words - hist_mean_words) > 200 or \\\n",
+ " abs((curr_item_created_time - hist_last_item_created_time).days) > 90: \n",
+ " continue\n",
+ " \n",
+ " cold_start_user_items_dict[user].append((item, score)) # {user1: [(item1, score1), (item2, score2)..]...}\n",
+ " \n",
+ " # 需要控制一下冷启动召回的数量\n",
+ " cold_start_user_items_dict = {k: sorted(v, key=lambda x:x[1], reverse=True)[:recall_item_num] \\\n",
+ " for k, v in cold_start_user_items_dict.items()}\n",
+ " \n",
+ " pickle.dump(cold_start_user_items_dict, open(save_path + 'cold_start_user_items_dict.pkl', 'wb'))\n",
+ " \n",
+ " return cold_start_user_items_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:35:38.758278Z",
+ "start_time": "2020-11-17T06:31:40.164332Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [01:49<00:00, 2293.37it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "all_click_df_ = all_click_df.copy()\n",
+ "all_click_df_ = all_click_df_.merge(item_info_df, how='left', on='click_article_id')\n",
+ "user_hist_item_typs_dict, user_hist_item_ids_dict, user_hist_item_words_dict, user_last_item_created_time_dict = get_user_hist_item_info_dict(all_click_df_)\n",
+ "click_article_ids_set = get_click_article_ids_set(all_click_df)\n",
+ "# 需要注意的是\n",
+ "# 这里使用了很多规则来筛选冷启动的文章,所以前面再召回的阶段就应该尽可能的多召回一些文章,否则很容易被删掉\n",
+ "cold_start_user_items_dict = cold_start_items(user_recall_items_dict, user_hist_item_typs_dict, user_hist_item_words_dict, \\\n",
+ " user_last_item_created_time_dict, item_type_dict, item_words_dict, \\\n",
+ " item_created_time_dict, click_article_ids_set, recall_item_num)\n",
+ "\n",
+ "user_multi_recall_dict['cold_start_recall'] = cold_start_user_items_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-16T07:13:33.099298Z",
+ "start_time": "2020-11-16T07:13:32.655036Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 多路召回合并\n",
+ "多路召回合并就是将前面所有的召回策略得到的用户文章列表合并起来,下面是对前面所有召回结果的汇总\n",
+ "1. 基于itemcf计算的item之间的相似度sim进行的召回 \n",
+ "2. 基于embedding搜索得到的item之间的相似度进行的召回\n",
+ "3. YoutubeDNN召回\n",
+ "4. YoutubeDNN得到的user之间的相似度进行的召回\n",
+ "5. 基于冷启动策略的召回\n",
+ "\n",
+ "**注意:** \n",
+ "在做召回评估的时候就会发现有些召回的效果不错有些召回的效果很差,所以对每一路召回的结果,我们可以认为的定义一些权重,来做最终的相似度融合"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:02:16.033971Z",
+ "start_time": "2020-11-17T07:02:16.019819Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def combine_recall_results(user_multi_recall_dict, weight_dict=None, topk=25):\n",
+ " final_recall_items_dict = {}\n",
+ " \n",
+ " # 对每一种召回结果按照用户进行归一化,方便后面多种召回结果,相同用户的物品之间权重相加\n",
+ " def norm_user_recall_items_sim(sorted_item_list):\n",
+ " # 如果冷启动中没有文章或者只有一篇文章,直接返回,出现这种情况的原因可能是冷启动召回的文章数量太少了,\n",
+ " # 基于规则筛选之后就没有文章了, 这里还可以做一些其他的策略性的筛选\n",
+ " if len(sorted_item_list) < 2:\n",
+ " return sorted_item_list\n",
+ " \n",
+ " min_sim = sorted_item_list[-1][1]\n",
+ " max_sim = sorted_item_list[0][1]\n",
+ " \n",
+ " norm_sorted_item_list = []\n",
+ " for item, score in sorted_item_list:\n",
+ " if max_sim > 0:\n",
+ " norm_score = 1.0 * (score - min_sim) / (max_sim - min_sim) if max_sim > min_sim else 1.0\n",
+ " else:\n",
+ " norm_score = 0.0\n",
+ " norm_sorted_item_list.append((item, norm_score))\n",
+ " \n",
+ " return norm_sorted_item_list\n",
+ " \n",
+ " print('多路召回合并...')\n",
+ " for method, user_recall_items in tqdm(user_multi_recall_dict.items()):\n",
+ " print(method + '...')\n",
+ " # 在计算最终召回结果的时候,也可以为每一种召回结果设置一个权重\n",
+ " if weight_dict == None:\n",
+ " recall_method_weight = 1\n",
+ " else:\n",
+ " recall_method_weight = weight_dict[method]\n",
+ " \n",
+ " for user_id, sorted_item_list in user_recall_items.items(): # 进行归一化\n",
+ " user_recall_items[user_id] = norm_user_recall_items_sim(sorted_item_list)\n",
+ " \n",
+ " for user_id, sorted_item_list in user_recall_items.items():\n",
+ " # print('user_id')\n",
+ " final_recall_items_dict.setdefault(user_id, {})\n",
+ " for item, score in sorted_item_list:\n",
+ " final_recall_items_dict[user_id].setdefault(item, 0)\n",
+ " final_recall_items_dict[user_id][item] += recall_method_weight * score \n",
+ " \n",
+ " final_recall_items_dict_rank = {}\n",
+ " # 多路召回时也可以控制最终的召回数量\n",
+ " for user, recall_item_dict in final_recall_items_dict.items():\n",
+ " final_recall_items_dict_rank[user] = sorted(recall_item_dict.items(), key=lambda x: x[1], reverse=True)[:topk]\n",
+ "\n",
+ " # 将多路召回后的最终结果字典保存到本地\n",
+ " pickle.dump(final_recall_items_dict_rank, open(os.path.join(save_path, 'final_recall_items_dict.pkl'),'wb'))\n",
+ "\n",
+ " return final_recall_items_dict_rank"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:02:21.078455Z",
+ "start_time": "2020-11-17T07:02:21.074060Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 这里直接对多路召回的权重给了一个相同的值,其实可以根据前面召回的情况来调整参数的值\n",
+ "weight_dict = {'itemcf_sim_itemcf_recall': 1.0,\n",
+ " 'embedding_sim_item_recall': 1.0,\n",
+ " 'youtubednn_recall': 1.0,\n",
+ " 'youtubednn_usercf_recall': 1.0, \n",
+ " 'cold_start_recall': 1.0}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:04:35.747924Z",
+ "start_time": "2020-11-17T07:02:26.889573Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 0%| | 0/5 [00:00, ?it/s]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "多路召回合并...\n",
+ "itemcf_sim_itemcf_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 20%|██ | 1/5 [00:08<00:34, 8.66s/it]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "embedding_sim_item_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 40%|████ | 2/5 [00:16<00:24, 8.29s/it]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "youtubednn_recall...\n",
+ "youtubednn_usercf_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ " 80%|████████ | 4/5 [00:23<00:06, 6.98s/it]"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "cold_start_recall...\n"
+ ]
+ },
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 5/5 [00:42<00:00, 8.40s/it]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 最终合并之后每个用户召回150个商品进行排序\n",
+ "final_recall_items_dict_rank = combine_recall_results(user_multi_recall_dict, weight_dict, topk=150)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 总结\n",
+ "\n",
+ "上述实现了如下召回策略:\n",
+ "\n",
+ "1. 基于关联规则的itemcf\n",
+ "2. 基于关联规则的usercf\n",
+ "3. youtubednn召回\n",
+ "4. 冷启动召回\n",
+ "\n",
+ "对于上述实现的召回策略其实都不是最优的结果,我们只是做了个简单的尝试,其中还有很多地方可以优化,包括已经实现的这些召回策略的参数或者新加一些,修改一些关联规则都可以。当然还可以尝试更多的召回策略,比如对新闻进行热度召回等等。\n",
+ "\n",
+ "\n",
+ "\n",
+ "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- ],
- "description": "",
- "notebookId": "130009",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "595px",
- "left": "61px",
- "top": "67px",
- "width": "174px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
- },
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.5"
+ },
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "nbTranslate": {
+ "displayLangs": [
+ "*"
+ ],
+ "hotkey": "alt-t",
+ "langInMainMenu": true,
+ "sourceLang": "en",
+ "targetLang": "fr",
+ "useGoogleTranslate": true
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [
+ {
+ "id": "83580",
+ "title": "零基础入门推荐系统 - 新闻推荐"
+ }
+ ],
+ "description": "",
+ "notebookId": "130009",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "595px",
+ "left": "61px",
+ "top": "67px",
+ "width": "174px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.ipynb" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.ipynb"
index f4e21cabc..d74eed156 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.ipynb"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.ipynb"
@@ -1,1772 +1,1772 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 制作特征和标签, 转成监督学习问题\n",
- "我们先捋一下基于原始的给定数据, 有哪些特征可以直接利用:\n",
- "1. 文章的自身特征, category_id表示这文章的类型, created_at_ts表示文章建立的时间, 这个关系着文章的时效性, words_count是文章的字数, 一般字数太长我们不太喜欢点击, 也不排除有人就喜欢读长文。\n",
- "2. 文章的内容embedding特征, 这个召回的时候用过, 这里可以选择使用, 也可以选择不用, 也可以尝试其他类型的embedding特征, 比如W2V等\n",
- "3. 用户的设备特征信息\n",
- "\n",
- "上面这些直接可以用的特征, 待做完特征工程之后, 直接就可以根据article_id或者是user_id把这些特征加入进去。 但是我们需要先基于召回的结果, 构造一些特征,然后制作标签,形成一个监督学习的数据集。 \n",
- "构造监督数据集的思路, 根据召回结果, 我们会得到一个{user_id: [可能点击的文章列表]}形式的字典。 那么我们就可以对于每个用户, 每篇可能点击的文章构造一个监督测试集, 比如对于用户user1, 假设得到的他的召回列表{user1: [item1, item2, item3]}, 我们就可以得到三行数据(user1, item1), (user1, item2), (user1, item3)的形式, 这就是监督测试集时候的前两列特征。 \n",
- "\n",
- "构造特征的思路是这样, 我们知道每个用户的点击文章是与其历史点击的文章信息是有很大关联的, 比如同一个主题, 相似等等。 所以特征构造这块很重要的一系列特征**是要结合用户的历史点击文章信息**。我们已经得到了每个用户及点击候选文章的两列的一个数据集, 而我们的目的是要预测最后一次点击的文章, 比较自然的一个思路就是和其最后几次点击的文章产生关系, 这样既考虑了其历史点击文章信息, 又得离最后一次点击较近,因为新闻很大的一个特点就是注重时效性。 往往用户的最后一次点击会和其最后几次点击有很大的关联。 所以我们就可以对于每个候选文章, 做出与最后几次点击相关的特征如下:\n",
- "1. 候选item与最后几次点击的相似性特征(embedding内积) --- 这个直接关联用户历史行为\n",
- "2. 候选item与最后几次点击的相似性特征的统计特征 --- 统计特征可以减少一些波动和异常\n",
- "3. 候选item与最后几次点击文章的字数差的特征 --- 可以通过字数看用户偏好\n",
- "4. 候选item与最后几次点击的文章建立的时间差特征 --- 时间差特征可以看出该用户对于文章的实时性的偏好 \n",
- "\n",
- "\n",
- "还需要考虑一下\n",
- "**5. 如果使用了youtube召回的话, 我们还可以制作用户与候选item的相似特征**\n",
- "\n",
- "\n",
- "\n",
- "当然, 上面只是提供了一种基于用户历史行为做特征工程的思路, 大家也可以思维风暴一下,尝试一些其他的特征。 下面我们就实现上面的这些特征的制作, 下面的逻辑是这样:\n",
- "1. 我们首先获得用户的最后一次点击操作和用户的历史点击, 这个基于我们的日志数据集做\n",
- "2. 基于用户的历史行为制作特征, 这个会用到用户的历史点击表, 最后的召回列表, 文章的信息表和embedding向量\n",
- "3. 制作标签, 形成最后的监督学习数据集"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 导包"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:00.341709Z",
- "start_time": "2020-11-17T09:06:58.723900Z"
- },
- "cell_style": "center",
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import pandas as pd\n",
- "import pickle\n",
- "from tqdm import tqdm\n",
- "import gc, os\n",
- "import logging\n",
- "import time\n",
- "import lightgbm as lgb\n",
- "from gensim.models import Word2Vec\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# df节省内存函数"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:02.411005Z",
- "start_time": "2020-11-17T09:07:02.397830Z"
- }
- },
- "outputs": [],
- "source": [
- "# 节省内存的一个函数\n",
- "# 减少内存\n",
- "def reduce_mem(df):\n",
- " starttime = time.time()\n",
- " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
- " start_mem = df.memory_usage().sum() / 1024**2\n",
- " for col in df.columns:\n",
- " col_type = df[col].dtypes\n",
- " if col_type in numerics:\n",
- " c_min = df[col].min()\n",
- " c_max = df[col].max()\n",
- " if pd.isnull(c_min) or pd.isnull(c_max):\n",
- " continue\n",
- " if str(col_type)[:3] == 'int':\n",
- " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
- " df[col] = df[col].astype(np.int8)\n",
- " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
- " df[col] = df[col].astype(np.int16)\n",
- " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
- " df[col] = df[col].astype(np.int32)\n",
- " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
- " df[col] = df[col].astype(np.int64)\n",
- " else:\n",
- " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
- " df[col] = df[col].astype(np.float16)\n",
- " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
- " df[col] = df[col].astype(np.float32)\n",
- " else:\n",
- " df[col] = df[col].astype(np.float64)\n",
- " end_mem = df.memory_usage().sum() / 1024**2\n",
- " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
- " 100*(start_mem-end_mem)/start_mem,\n",
- " (time.time()-starttime)/60))\n",
- " return df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:05.031436Z",
- "start_time": "2020-11-17T09:07:05.026822Z"
- }
- },
- "outputs": [],
- "source": [
- "data_path = './data_raw/'\n",
- "save_path = './temp_results/'"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 数据读取\n",
- "\n",
- "## 训练和验证集的划分\n",
- "\n",
- "划分训练和验证集的原因是为了在线下验证模型参数的好坏,为了完全模拟测试集,我们这里就在训练集中抽取部分用户的所有信息来作为验证集。提前做训练验证集划分的好处就是可以分解制作排序特征时的压力,一次性做整个数据集的排序特征可能时间会比较长。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:07.230308Z",
- "start_time": "2020-11-17T09:07:07.221081Z"
- }
- },
- "outputs": [],
- "source": [
- "# all_click_df指的是训练集\n",
- "# sample_user_nums 采样作为验证集的用户数量\n",
- "def trn_val_split(all_click_df, sample_user_nums):\n",
- " all_click = all_click_df\n",
- " all_user_ids = all_click.user_id.unique()\n",
- " \n",
- " # replace=True表示可以重复抽样,反之不可以\n",
- " sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False) \n",
- " \n",
- " click_val = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
- " click_trn = all_click[~all_click['user_id'].isin(sample_user_ids)]\n",
- " \n",
- " # 将验证集中的最后一次点击给抽取出来作为答案\n",
- " click_val = click_val.sort_values(['user_id', 'click_timestamp'])\n",
- " val_ans = click_val.groupby('user_id').tail(1)\n",
- " \n",
- " click_val = click_val.groupby('user_id').apply(lambda x: x[:-1]).reset_index(drop=True)\n",
- " \n",
- " # 去除val_ans中某些用户只有一个点击数据的情况,如果该用户只有一个点击数据,又被分到ans中,\n",
- " # 那么训练集中就没有这个用户的点击数据,出现用户冷启动问题,给自己模型验证带来麻烦\n",
- " val_ans = val_ans[val_ans.user_id.isin(click_val.user_id.unique())] # 保证答案中出现的用户再验证集中还有\n",
- " click_val = click_val[click_val.user_id.isin(val_ans.user_id.unique())]\n",
- " \n",
- " return click_trn, click_val, val_ans"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 获取历史点击和最后一次点击"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:19.202550Z",
- "start_time": "2020-11-17T09:07:19.195766Z"
- }
- },
- "outputs": [],
- "source": [
- "# 获取当前数据的历史点击和最后一次点击\n",
- "def get_hist_and_last_click(all_click):\n",
- " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
- " click_last_df = all_click.groupby('user_id').tail(1)\n",
- "\n",
- " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
- " def hist_func(user_df):\n",
- " if len(user_df) == 1:\n",
- " return user_df\n",
- " else:\n",
- " return user_df[:-1]\n",
- "\n",
- " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
- "\n",
- " return click_hist_df, click_last_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取训练、验证及测试集"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:21.181211Z",
- "start_time": "2020-11-17T09:07:21.171338Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_trn_val_tst_data(data_path, offline=True):\n",
- " if offline:\n",
- " click_trn_data = pd.read_csv(data_path+'train_click_log.csv') # 训练集用户点击日志\n",
- " click_trn_data = reduce_mem(click_trn_data)\n",
- " click_trn, click_val, val_ans = trn_val_split(click_trn_data, sample_user_nums)\n",
- " else:\n",
- " click_trn = pd.read_csv(data_path+'train_click_log.csv')\n",
- " click_trn = reduce_mem(click_trn)\n",
- " click_val = None\n",
- " val_ans = None\n",
- " \n",
- " click_tst = pd.read_csv(data_path+'testA_click_log.csv')\n",
- " \n",
- " return click_trn, click_val, click_tst, val_ans"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取召回列表"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:23.210604Z",
- "start_time": "2020-11-17T09:07:23.203652Z"
- }
- },
- "outputs": [],
- "source": [
- "# 返回多路召回列表或者单路召回\n",
- "def get_recall_list(save_path, single_recall_model=None, multi_recall=False):\n",
- " if multi_recall:\n",
- " return pickle.load(open(save_path + 'final_recall_items_dict.pkl', 'rb'))\n",
- " \n",
- " if single_recall_model == 'i2i_itemcf':\n",
- " return pickle.load(open(save_path + 'itemcf_recall_dict.pkl', 'rb'))\n",
- " elif single_recall_model == 'i2i_emb_itemcf':\n",
- " return pickle.load(open(save_path + 'itemcf_emb_dict.pkl', 'rb'))\n",
- " elif single_recall_model == 'user_cf':\n",
- " return pickle.load(open(save_path + 'youtubednn_usercf_dict.pkl', 'rb'))\n",
- " elif single_recall_model == 'youtubednn':\n",
- " return pickle.load(open(save_path + 'youtube_u2i_dict.pkl', 'rb'))"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取各种Embedding"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "##### Word2Vec训练及gensim的使用\n",
- "\n",
- "Word2Vec主要思想是:一个词的上下文可以很好的表达出词的语义。通过无监督学习产生词向量的方式。word2vec中有两个非常经典的模型:skip-gram和cbow。\n",
- "\n",
- "- skip-gram:已知中心词预测周围词。\n",
- "- cbow:已知周围词预测中心词。\n",
- "![image-20201106225233086](http://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)\n",
- "\n",
- "在使用gensim训练word2vec的时候,有几个比较重要的参数\n",
- "- size: 表示词向量的维度。\n",
- "- window:决定了目标词会与多远距离的上下文产生关系。\n",
- "- sg: 如果是0,则是CBOW模型,是1则是Skip-Gram模型。\n",
- "- workers: 表示训练时候的线程数量\n",
- "- min_count: 设置最小的\n",
- "- iter: 训练时遍历整个数据集的次数\n",
- "\n",
- "**注意**\n",
- "1. 训练的时候输入的语料库一定要是字符组成的二维数组,如:[['北', '京', '你', '好'], ['上', '海', '你', '好']]\n",
- "2. 使用模型的时候有一些默认值,可以通过在Jupyter里面通过`Word2Vec??`查看\n",
- "\n",
- "\n",
- "下面是个简单的测试样例:\n",
- "```\n",
- "from gensim.models import Word2Vec\n",
- "doc = [['30760', '157507'],\n",
- " ['289197', '63746'],\n",
- " ['36162', '168401'],\n",
- " ['50644', '36162']]\n",
- "w2v = Word2Vec(docs, size=12, sg=1, window=2, seed=2020, workers=2, min_count=1, iter=1)\n",
- "\n",
- "# 查看'30760'表示的词向量\n",
- "w2v['30760']\n",
- "```\n",
- "\n",
- "skip-gram和cbow的详细原理可以参考下面的博客:\n",
- "- [word2vec原理(一) CBOW与Skip-Gram模型基础](https://www.cnblogs.com/pinard/p/7160330.html) \n",
- "- [word2vec原理(二) 基于Hierarchical Softmax的模型](https://www.cnblogs.com/pinard/p/7160330.html) \n",
- "- [word2vec原理(三) 基于Negative Sampling的模型](https://www.cnblogs.com/pinard/p/7249903.html) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:26.676173Z",
- "start_time": "2020-11-17T09:07:26.667926Z"
- }
- },
- "outputs": [],
- "source": [
- "def trian_item_word2vec(click_df, embed_size=64, save_name='item_w2v_emb.pkl', split_char=' '):\n",
- " click_df = click_df.sort_values('click_timestamp')\n",
- " # 只有转换成字符串才可以进行训练\n",
- " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
- " # 转换成句子的形式\n",
- " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
- " docs = docs['click_article_id'].values.tolist()\n",
- "\n",
- " # 为了方便查看训练的进度,这里设定一个log信息\n",
- " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
- "\n",
- " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
- " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=1)\n",
- " \n",
- " # 保存成字典的形式\n",
- " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
- " pickle.dump(item_w2v_emb_dict, open(save_path + 'item_w2v_emb.pkl', 'wb'))\n",
- " \n",
- " return item_w2v_emb_dict"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:27.285690Z",
- "start_time": "2020-11-17T09:07:27.276646Z"
- }
- },
- "outputs": [],
- "source": [
- "# 可以通过字典查询对应的item的Embedding\n",
- "def get_embedding(save_path, all_click_df):\n",
- " if os.path.exists(save_path + 'item_content_emb.pkl'):\n",
- " item_content_emb_dict = pickle.load(open(save_path + 'item_content_emb.pkl', 'rb'))\n",
- " else:\n",
- " print('item_content_emb.pkl 文件不存在...')\n",
- " \n",
- " # w2v Embedding是需要提前训练好的\n",
- " if os.path.exists(save_path + 'item_w2v_emb.pkl'):\n",
- " item_w2v_emb_dict = pickle.load(open(save_path + 'item_w2v_emb.pkl', 'rb'))\n",
- " else:\n",
- " item_w2v_emb_dict = trian_item_word2vec(all_click_df)\n",
- " \n",
- " if os.path.exists(save_path + 'item_youtube_emb.pkl'):\n",
- " item_youtube_emb_dict = pickle.load(open(save_path + 'item_youtube_emb.pkl', 'rb'))\n",
- " else:\n",
- " print('item_youtube_emb.pkl 文件不存在...')\n",
- " \n",
- " if os.path.exists(save_path + 'user_youtube_emb.pkl'):\n",
- " user_youtube_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
- " else:\n",
- " print('user_youtube_emb.pkl 文件不存在...')\n",
- " \n",
- " return item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取文章信息"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:28.391797Z",
- "start_time": "2020-11-17T09:07:28.386650Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_article_info_df():\n",
- " article_info_df = pd.read_csv(data_path + 'articles.csv')\n",
- " article_info_df = reduce_mem(article_info_df)\n",
- " \n",
- " return article_info_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取数据"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:07:32.362045Z",
- "start_time": "2020-11-17T09:07:29.490413Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "-- Mem. usage decreased to 23.34 Mb (69.4% reduction),time spend:0.00 min\n"
- ]
- }
- ],
- "source": [
- "# 这里offline的online的区别就是验证集是否为空\n",
- "click_trn, click_val, click_tst, val_ans = get_trn_val_tst_data(data_path, offline=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:10.378966Z",
- "start_time": "2020-11-17T09:07:32.468580Z"
- }
- },
- "outputs": [],
- "source": [
- "click_trn_hist, click_trn_last = get_hist_and_last_click(click_trn)\n",
- "\n",
- "if click_val is not None:\n",
- " click_val_hist, click_val_last = click_val, val_ans\n",
- "else:\n",
- " click_val_hist, click_val_last = None, None\n",
- " \n",
- "click_tst_hist = click_tst"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 对训练数据做负采样\n",
- "\n",
- "通过召回我们将数据转换成三元组的形式(user1, item1, label)的形式,观察发现正负样本差距极度不平衡,我们可以先对负样本进行下采样,下采样的目的一方面缓解了正负样本比例的问题,另一方面也减小了我们做排序特征的压力,我们在做负采样的时候又有哪些东西是需要注意的呢?\n",
- "\n",
- "1. 只对负样本进行下采样(如果有比较好的正样本扩充的方法其实也是可以考虑的)\n",
- "2. 负采样之后,保证所有的用户和文章仍然出现在采样之后的数据中\n",
- "3. 下采样的比例可以根据实际情况人为的控制\n",
- "4. 做完负采样之后,更新此时新的用户召回文章列表,因为后续做特征的时候可能用到相对位置的信息。\n",
- "\n",
- "其实负采样也可以留在后面做完特征在进行,这里由于做排序特征太慢了,所以把负采样的环节提到前面了。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:36.096678Z",
- "start_time": "2020-11-17T09:11:36.090911Z"
- }
- },
- "outputs": [],
- "source": [
- "# 将召回列表转换成df的形式\n",
- "def recall_dict_2_df(recall_list_dict):\n",
- " df_row_list = [] # [user, item, score]\n",
- " for user, recall_list in tqdm(recall_list_dict.items()):\n",
- " for item, score in recall_list:\n",
- " df_row_list.append([user, item, score])\n",
- " \n",
- " col_names = ['user_id', 'sim_item', 'score']\n",
- " recall_list_df = pd.DataFrame(df_row_list, columns=col_names)\n",
- " \n",
- " return recall_list_df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:37.668844Z",
- "start_time": "2020-11-17T09:11:37.659774Z"
- }
- },
- "outputs": [],
- "source": [
- "# 负采样函数,这里可以控制负采样时的比例, 这里给了一个默认的值\n",
- "def neg_sample_recall_data(recall_items_df, sample_rate=0.001):\n",
- " pos_data = recall_items_df[recall_items_df['label'] == 1]\n",
- " neg_data = recall_items_df[recall_items_df['label'] == 0]\n",
- " \n",
- " print('pos_data_num:', len(pos_data), 'neg_data_num:', len(neg_data), 'pos/neg:', len(pos_data)/len(neg_data))\n",
- " \n",
- " # 分组采样函数\n",
- " def neg_sample_func(group_df):\n",
- " neg_num = len(group_df)\n",
- " sample_num = max(int(neg_num * sample_rate), 1) # 保证最少有一个\n",
- " sample_num = min(sample_num, 5) # 保证最多不超过5个,这里可以根据实际情况进行选择\n",
- " return group_df.sample(n=sample_num, replace=True)\n",
- " \n",
- " # 对用户进行负采样,保证所有用户都在采样后的数据中\n",
- " neg_data_user_sample = neg_data.groupby('user_id', group_keys=False).apply(neg_sample_func)\n",
- " # 对文章进行负采样,保证所有文章都在采样后的数据中\n",
- " neg_data_item_sample = neg_data.groupby('sim_item', group_keys=False).apply(neg_sample_func)\n",
- " \n",
- " # 将上述两种情况下的采样数据合并\n",
- " neg_data_new = neg_data_user_sample.append(neg_data_item_sample)\n",
- " # 由于上述两个操作是分开的,可能将两个相同的数据给重复选择了,所以需要对合并后的数据进行去重\n",
- " neg_data_new = neg_data_new.sort_values(['user_id', 'score']).drop_duplicates(['user_id', 'sim_item'], keep='last')\n",
- " \n",
- " # 将正样本数据合并\n",
- " data_new = pd.concat([pos_data, neg_data_new], ignore_index=True)\n",
- " \n",
- " return data_new"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:39.481715Z",
- "start_time": "2020-11-17T09:11:39.475144Z"
- }
- },
- "outputs": [],
- "source": [
- "# 召回数据打标签\n",
- "def get_rank_label_df(recall_list_df, label_df, is_test=False):\n",
- " # 测试集是没有标签了,为了后面代码同一一些,这里直接给一个负数替代\n",
- " if is_test:\n",
- " recall_list_df['label'] = -1\n",
- " return recall_list_df\n",
- " \n",
- " label_df = label_df.rename(columns={'click_article_id': 'sim_item'})\n",
- " recall_list_df_ = recall_list_df.merge(label_df[['user_id', 'sim_item', 'click_timestamp']], \\\n",
- " how='left', on=['user_id', 'sim_item'])\n",
- " recall_list_df_['label'] = recall_list_df_['click_timestamp'].apply(lambda x: 0.0 if np.isnan(x) else 1.0)\n",
- " del recall_list_df_['click_timestamp']\n",
- " \n",
- " return recall_list_df_"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T09:11:41.555566Z",
- "start_time": "2020-11-17T09:11:41.546766Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_user_recall_item_label_df(click_trn_hist, click_val_hist, click_tst_hist,click_trn_last, click_val_last, recall_list_df):\n",
- " # 获取训练数据的召回列表\n",
- " trn_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_trn_hist['user_id'].unique())]\n",
- " # 训练数据打标签\n",
- " trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False)\n",
- " # 训练数据负采样\n",
- " trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df)\n",
- " \n",
- " if click_val is not None:\n",
- " val_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_val_hist['user_id'].unique())]\n",
- " val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False)\n",
- " val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df)\n",
- " else:\n",
- " val_user_item_label_df = None\n",
- " \n",
- " # 测试数据不需要进行负采样,直接对所有的召回商品进行打-1标签\n",
- " tst_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_tst_hist['user_id'].unique())]\n",
- " tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None, is_test=True)\n",
- " \n",
- " return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 56,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:23:35.357045Z",
- "start_time": "2020-11-17T17:23:12.378284Z"
- }
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 250000/250000 [00:12<00:00, 20689.39it/s]\n"
- ]
- }
- ],
- "source": [
- "# 读取召回列表\n",
- "recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf') # 这里只选择了单路召回的结果,也可以选择多路召回结果\n",
- "# 将召回数据转换成df\n",
- "recall_list_df = recall_dict_2_df(recall_list_dict)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 57,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:29:04.598214Z",
- "start_time": "2020-11-17T17:23:40.001052Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "pos_data_num: 64190 neg_data_num: 1935810 pos/neg: 0.03315924600038227\n"
- ]
- }
- ],
- "source": [
- "# 给训练验证数据打标签,并负采样(这一部分时间比较久)\n",
- "trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df(click_trn_hist, \n",
- " click_val_hist, \n",
- " click_tst_hist,\n",
- " click_trn_last, \n",
- " click_val_last, \n",
- " recall_list_df)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:23:11.642944Z",
- "start_time": "2020-11-17T17:23:08.475Z"
- },
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "trn_user_item_label_df.label"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 将召回数据转换成字典"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 58,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:36:22.800449Z",
- "start_time": "2020-11-17T17:36:22.794670Z"
- }
- },
- "outputs": [],
- "source": [
- "# 将最终的召回的df数据转换成字典的形式做排序特征\n",
- "def make_tuple_func(group_df):\n",
- " row_data = []\n",
- " for name, row_df in group_df.iterrows():\n",
- " row_data.append((row_df['sim_item'], row_df['score'], row_df['label']))\n",
- " \n",
- " return row_data"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 59,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T17:40:05.991819Z",
- "start_time": "2020-11-17T17:36:26.536429Z"
- }
- },
- "outputs": [],
- "source": [
- "trn_user_item_label_tuples = trn_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
- "trn_user_item_label_tuples_dict = dict(zip(trn_user_item_label_tuples['user_id'], trn_user_item_label_tuples[0]))\n",
- "\n",
- "if val_user_item_label_df is not None:\n",
- " val_user_item_label_tuples = val_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
- " val_user_item_label_tuples_dict = dict(zip(val_user_item_label_tuples['user_id'], val_user_item_label_tuples[0]))\n",
- "else:\n",
- " val_user_item_label_tuples_dict = None\n",
- " \n",
- "tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
- "tst_user_item_label_tuples_dict = dict(zip(tst_user_item_label_tuples['user_id'], tst_user_item_label_tuples[0]))"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T07:59:53.141560Z",
- "start_time": "2020-11-17T07:59:53.133599Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 特征工程"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 制作与用户历史行为相关特征\n",
- "对于每个用户召回的每个商品, 做特征。 具体步骤如下:\n",
- "* 对于每个用户, 获取最后点击的N个商品的item_id, \n",
- " * 对于该用户的每个召回商品, 计算与上面最后N次点击商品的相似度的和(最大, 最小,均值), 时间差特征,相似性特征,字数差特征,与该用户的相似性特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 60,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T01:07:47.268035Z",
- "start_time": "2020-11-18T01:07:47.250449Z"
- }
- },
- "outputs": [],
- "source": [
- "# 下面基于data做历史相关的特征\n",
- "def create_feature(users_id, recall_list, click_hist_df, articles_info, articles_emb, user_emb=None, N=1):\n",
- " \"\"\"\n",
- " 基于用户的历史行为做相关特征\n",
- " :param users_id: 用户id\n",
- " :param recall_list: 对于每个用户召回的候选文章列表\n",
- " :param click_hist_df: 用户的历史点击信息\n",
- " :param articles_info: 文章信息\n",
- " :param articles_emb: 文章的embedding向量, 这个可以用item_content_emb, item_w2v_emb, item_youtube_emb\n",
- " :param user_emb: 用户的embedding向量, 这个是user_youtube_emb, 如果没有也可以不用, 但要注意如果要用的话, articles_emb就要用item_youtube_emb的形式, 这样维度才一样\n",
- " :param N: 最近的N次点击 由于testA日志里面很多用户只存在一次历史点击, 所以为了不产生空值,默认是1\n",
- " \"\"\"\n",
- " \n",
- " # 建立一个二维列表保存结果, 后面要转成DataFrame\n",
- " all_user_feas = []\n",
- " i = 0\n",
- " for user_id in tqdm(users_id):\n",
- " # 该用户的最后N次点击\n",
- " hist_user_items = click_hist_df[click_hist_df['user_id']==user_id]['click_article_id'][-N:]\n",
- " \n",
- " # 遍历该用户的召回列表\n",
- " for rank, (article_id, score, label) in enumerate(recall_list[user_id]):\n",
- " # 该文章建立时间, 字数\n",
- " a_create_time = articles_info[articles_info['article_id']==article_id]['created_at_ts'].values[0]\n",
- " a_words_count = articles_info[articles_info['article_id']==article_id]['words_count'].values[0]\n",
- " single_user_fea = [user_id, article_id]\n",
- " # 计算与最后点击的商品的相似度的和, 最大值和最小值, 均值\n",
- " sim_fea = []\n",
- " time_fea = []\n",
- " word_fea = []\n",
- " # 遍历用户的最后N次点击文章\n",
- " for hist_item in hist_user_items:\n",
- " b_create_time = articles_info[articles_info['article_id']==hist_item]['created_at_ts'].values[0]\n",
- " b_words_count = articles_info[articles_info['article_id']==hist_item]['words_count'].values[0]\n",
- " \n",
- " sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id]))\n",
- " time_fea.append(abs(a_create_time-b_create_time))\n",
- " word_fea.append(abs(a_words_count-b_words_count))\n",
- " \n",
- " single_user_fea.extend(sim_fea) # 相似性特征\n",
- " single_user_fea.extend(time_fea) # 时间差特征\n",
- " single_user_fea.extend(word_fea) # 字数差特征\n",
- " single_user_fea.extend([max(sim_fea), min(sim_fea), sum(sim_fea), sum(sim_fea) / len(sim_fea)]) # 相似性的统计特征\n",
- " \n",
- " if user_emb: # 如果用户向量有的话, 这里计算该召回文章与用户的相似性特征 \n",
- " single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id]))\n",
- " \n",
- " single_user_fea.extend([score, rank, label]) \n",
- " # 加入到总的表中\n",
- " all_user_feas.append(single_user_fea)\n",
- " \n",
- " # 定义列名\n",
- " id_cols = ['user_id', 'click_article_id']\n",
- " sim_cols = ['sim' + str(i) for i in range(N)]\n",
- " time_cols = ['time_diff' + str(i) for i in range(N)]\n",
- " word_cols = ['word_diff' + str(i) for i in range(N)]\n",
- " sat_cols = ['sim_max', 'sim_min', 'sim_sum', 'sim_mean']\n",
- " user_item_sim_cols = ['user_item_sim'] if user_emb else []\n",
- " user_score_rank_label = ['score', 'rank', 'label']\n",
- " cols = id_cols + sim_cols + time_cols + word_cols + sat_cols + user_item_sim_cols + user_score_rank_label\n",
- " \n",
- " # 转成DataFrame\n",
- " df = pd.DataFrame( all_user_feas, columns=cols)\n",
- " \n",
- " return df"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 61,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T01:08:17.531694Z",
- "start_time": "2020-11-18T01:08:10.754702Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
- ]
- }
- ],
- "source": [
- "article_info_df = get_article_info_df()\n",
- "all_click = click_trn.append(click_tst)\n",
- "item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, all_click)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 62,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:06:22.709350Z",
- "start_time": "2020-11-18T01:08:39.923811Z"
- },
- "scrolled": true
- },
- "outputs": [
- {
- "name": "stderr",
- "output_type": "stream",
- "text": [
- "100%|██████████| 200000/200000 [50:16<00:00, 66.31it/s] \n",
- "100%|██████████| 50000/50000 [1:07:21<00:00, 12.37it/s]\n"
- ]
- }
- ],
- "source": [
- "# 获取训练验证及测试数据中召回列文章相关特征\n",
- "trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict, \\\n",
- " click_trn_hist, article_info_df, item_content_emb_dict)\n",
- "\n",
- "if val_user_item_label_tuples_dict is not None:\n",
- " val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict, \\\n",
- " click_val_hist, article_info_df, item_content_emb_dict)\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict, \\\n",
- " click_tst_hist, article_info_df, item_content_emb_dict)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 63,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:13:58.573422Z",
- "start_time": "2020-11-18T03:13:40.157228Z"
- }
- },
- "outputs": [],
- "source": [
- "# 保存一份省的每次都要重新跑,每次跑的时间都比较长\n",
- "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
- "\n",
- "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:14:22.838154Z",
- "start_time": "2020-11-18T03:14:22.828212Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 用户和文章特征\n",
- "### 用户相关特征\n",
- "这一块,正式进行特征工程,既要拼接上已有的特征, 也会做更多的特征出来,我们来梳理一下已有的特征和可构造特征:\n",
- "1. 文章自身的特征, 文章字数,文章创建时间, 文章的embedding (articles表中)\n",
- "2. 用户点击环境特征, 那些设备的特征(这个在df中)\n",
- "3. 对于用户和商品还可以构造的特征:\n",
- " * 基于用户的点击文章次数和点击时间构造可以表现用户活跃度的特征\n",
- " * 基于文章被点击次数和时间构造可以反映文章热度的特征\n",
- " * 用户的时间统计特征: 根据其点击的历史文章列表的点击时间和文章的创建时间做统计特征,比如求均值, 这个可以反映用户对于文章时效的偏好\n",
- " * 用户的主题爱好特征, 对于用户点击的历史文章主题进行一个统计, 然后对于当前文章看看是否属于用户已经点击过的主题\n",
- " * 用户的字数爱好特征, 对于用户点击的历史文章的字数统计, 求一个均值"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:16:37.637495Z",
- "start_time": "2020-11-14T03:16:37.618229Z"
- }
- },
- "outputs": [],
- "source": [
- "click_tst.head()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:09:11.675550Z",
- "start_time": "2020-11-17T02:09:10.265134Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取文章特征\n",
- "articles = pd.read_csv(data_path+'articles.csv')\n",
- "articles = reduce_mem(articles)\n",
- "\n",
- "# 日志数据,就是前面的所有数据\n",
- "if click_val is not None:\n",
- " all_data = click_trn.append(click_val)\n",
- "all_data = click_trn.append(click_tst)\n",
- "all_data = reduce_mem(all_data)\n",
- "\n",
- "# 拼上文章信息\n",
- "all_data = all_data.merge(articles, left_on='click_article_id', right_on='article_id')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:17:12.256244Z",
- "start_time": "2020-11-14T03:17:12.250452Z"
- }
- },
- "outputs": [],
- "source": [
- "all_data.shape"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 分析一下点击时间和点击文章的次数,区分用户活跃度\n",
- "如果某个用户点击文章之间的时间间隔比较小, 同时点击的文章次数很多的话, 那么我们认为这种用户一般就是活跃用户, 当然衡量用户活跃度的方式可能多种多样, 这里我们只提供其中一种,我们写一个函数, 得到可以衡量用户活跃度的特征,逻辑如下:\n",
- "1. 首先根据用户user_id分组, 对于每个用户,计算点击文章的次数, 两两点击文章时间间隔的均值\n",
- "2. 把点击次数取倒数和时间间隔的均值统一归一化,然后两者相加合并,该值越小, 说明用户越活跃\n",
- "3. 注意, 上面两两点击文章的时间间隔均值, 会出现如果用户只点击了一次的情况,这时候时间间隔均值那里会出现空值, 对于这种情况最后特征那里给个大数进行区分\n",
- "\n",
- "这个的衡量标准就是先把点击的次数取到数然后归一化, 然后点击的时间差归一化, 然后两者相加进行合并, 该值越小, 说明被点击的次数越多, 且间隔时间短。 "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:28:55.336058Z",
- "start_time": "2020-11-17T02:28:55.324332Z"
- }
- },
- "outputs": [],
- "source": [
- " def active_level(all_data, cols):\n",
- " \"\"\"\n",
- " 制作区分用户活跃度的特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " data = all_data[cols]\n",
- " data.sort_values(['user_id', 'click_timestamp'], inplace=True)\n",
- " user_act = pd.DataFrame(data.groupby('user_id', as_index=False)[['click_article_id', 'click_timestamp']].\\\n",
- " agg({'click_article_id':np.size, 'click_timestamp': {list}}).values, columns=['user_id', 'click_size', 'click_timestamp'])\n",
- " \n",
- " # 计算时间间隔的均值\n",
- " def time_diff_mean(l):\n",
- " if len(l) == 1:\n",
- " return 1\n",
- " else:\n",
- " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
- " \n",
- " user_act['time_diff_mean'] = user_act['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
- " \n",
- " # 点击次数取倒数\n",
- " user_act['click_size'] = 1 / user_act['click_size']\n",
- " \n",
- " # 两者归一化\n",
- " user_act['click_size'] = (user_act['click_size'] - user_act['click_size'].min()) / (user_act['click_size'].max() - user_act['click_size'].min())\n",
- " user_act['time_diff_mean'] = (user_act['time_diff_mean'] - user_act['time_diff_mean'].min()) / (user_act['time_diff_mean'].max() - user_act['time_diff_mean'].min()) \n",
- " user_act['active_level'] = user_act['click_size'] + user_act['time_diff_mean']\n",
- " \n",
- " user_act['user_id'] = user_act['user_id'].astype('int')\n",
- " del user_act['click_timestamp']\n",
- " \n",
- " return user_act"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:30:12.696060Z",
- "start_time": "2020-11-17T02:29:01.523837Z"
- }
- },
- "outputs": [],
- "source": [
- "user_act_fea = active_level(all_data, ['user_id', 'click_article_id', 'click_timestamp'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:28:53.996742Z",
- "start_time": "2020-11-17T02:09:18.374Z"
- }
- },
- "outputs": [],
- "source": [
- "user_act_fea.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 分析一下点击时间和被点击文章的次数, 衡量文章热度特征\n",
- "和上面同样的思路, 如果一篇文章在很短的时间间隔之内被点击了很多次, 说明文章比较热门,实现的逻辑和上面的基本一致, 只不过这里是按照点击的文章进行分组:\n",
- "1. 根据文章进行分组, 对于每篇文章的用户, 计算点击的时间间隔\n",
- "2. 将用户的数量取倒数, 然后用户的数量和时间间隔归一化, 然后相加得到热度特征, 该值越小, 说明被点击的次数越大且时间间隔越短, 文章比较热\n",
- "\n",
- "当然, 这只是给出一种判断文章热度的一种方法, 这里大家也可以头脑风暴一下"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:41:26.398567Z",
- "start_time": "2020-11-17T02:41:26.386668Z"
- }
- },
- "outputs": [],
- "source": [
- " def hot_level(all_data, cols):\n",
- " \"\"\"\n",
- " 制作衡量文章热度的特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " data = all_data[cols]\n",
- " data.sort_values(['click_article_id', 'click_timestamp'], inplace=True)\n",
- " article_hot = pd.DataFrame(data.groupby('click_article_id', as_index=False)[['user_id', 'click_timestamp']].\\\n",
- " agg({'user_id':np.size, 'click_timestamp': {list}}).values, columns=['click_article_id', 'user_num', 'click_timestamp'])\n",
- " \n",
- " # 计算被点击时间间隔的均值\n",
- " def time_diff_mean(l):\n",
- " if len(l) == 1:\n",
- " return 1\n",
- " else:\n",
- " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
- " \n",
- " article_hot['time_diff_mean'] = article_hot['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
- " \n",
- " # 点击次数取倒数\n",
- " article_hot['user_num'] = 1 / article_hot['user_num']\n",
- " \n",
- " # 两者归一化\n",
- " article_hot['user_num'] = (article_hot['user_num'] - article_hot['user_num'].min()) / (article_hot['user_num'].max() - article_hot['user_num'].min())\n",
- " article_hot['time_diff_mean'] = (article_hot['time_diff_mean'] - article_hot['time_diff_mean'].min()) / (article_hot['time_diff_mean'].max() - article_hot['time_diff_mean'].min()) \n",
- " article_hot['hot_level'] = article_hot['user_num'] + article_hot['time_diff_mean']\n",
- " \n",
- " article_hot['click_article_id'] = article_hot['click_article_id'].astype('int')\n",
- " \n",
- " del article_hot['click_timestamp']\n",
- " \n",
- " return article_hot"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T02:41:44.635900Z",
- "start_time": "2020-11-17T02:41:31.473032Z"
- }
- },
- "outputs": [],
- "source": [
- "article_hot_fea = hot_level(all_data, ['user_id', 'click_article_id', 'click_timestamp']) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:19:54.775290Z",
- "start_time": "2020-11-14T03:19:54.763699Z"
- }
- },
- "outputs": [],
- "source": [
- "article_hot_fea.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的系列习惯\n",
- "这个基于原来的日志表做一个类似于article的那种DataFrame, 存放用户特有的信息, 主要包括点击习惯, 爱好特征之类的\n",
- "* 用户的设备习惯, 这里取最常用的设备(众数)\n",
- "* 用户的时间习惯: 根据其点击过得历史文章的时间来做一个统计(这个感觉最好是把时间戳里的时间特征的h特征提出来,看看用户习惯一天的啥时候点击文章), 但这里先用转换的时间吧, 求个均值\n",
- "* 用户的爱好特征, 对于用户点击的历史文章主题进行用户的爱好判别, 更偏向于哪几个主题, 这个最好是multi-hot进行编码, 先试试行不\n",
- "* 用户文章的字数差特征, 用户的爱好文章的字数习惯\n",
- "\n",
- "这些就是对用户进行分组, 然后统计即可"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的设备习惯"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T04:22:48.877978Z",
- "start_time": "2020-11-17T04:22:48.872049Z"
- }
- },
- "outputs": [],
- "source": [
- "def device_fea(all_data, cols):\n",
- " \"\"\"\n",
- " 制作用户的设备特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " user_device_info = all_data[cols]\n",
- " \n",
- " # 用众数来表示每个用户的设备信息\n",
- " user_device_info = user_device_info.groupby('user_id').agg(lambda x: x.value_counts().index[0]).reset_index()\n",
- " \n",
- " return user_device_info"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T05:27:10.897473Z",
- "start_time": "2020-11-17T04:49:33.214865Z"
- }
- },
- "outputs": [],
- "source": [
- "# 设备特征(这里时间会比较长)\n",
- "device_cols = ['user_id', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 'click_referrer_type']\n",
- "user_device_info = device_fea(all_data, device_cols)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T04:20:39.765842Z",
- "start_time": "2020-11-14T04:20:39.747087Z"
- }
- },
- "outputs": [],
- "source": [
- "user_device_info.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的时间习惯"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:11:50.889905Z",
- "start_time": "2020-11-17T06:11:50.882653Z"
- }
- },
- "outputs": [],
- "source": [
- "def user_time_hob_fea(all_data, cols):\n",
- " \"\"\"\n",
- " 制作用户的时间习惯特征\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " user_time_hob_info = all_data[cols]\n",
- " \n",
- " # 先把时间戳进行归一化\n",
- " mm = MinMaxScaler()\n",
- " user_time_hob_info['click_timestamp'] = mm.fit_transform(user_time_hob_info[['click_timestamp']])\n",
- " user_time_hob_info['created_at_ts'] = mm.fit_transform(user_time_hob_info[['created_at_ts']])\n",
- "\n",
- " user_time_hob_info = user_time_hob_info.groupby('user_id').agg('mean').reset_index()\n",
- " \n",
- " user_time_hob_info.rename(columns={'click_timestamp': 'user_time_hob1', 'created_at_ts': 'user_time_hob2'}, inplace=True)\n",
- " return user_time_hob_info"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:31:51.646110Z",
- "start_time": "2020-11-17T06:31:51.171431Z"
- }
- },
- "outputs": [],
- "source": [
- "user_time_hob_cols = ['user_id', 'click_timestamp', 'created_at_ts']\n",
- "user_time_hob_info = user_time_hob_fea(all_data, user_time_hob_cols)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的主题爱好\n",
- "这里先把用户点击的文章属于的主题转成一个列表, 后面再总的汇总的时候单独制作一个特征, 就是文章的主题如果属于这里面, 就是1, 否则就是0。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:31:56.571088Z",
- "start_time": "2020-11-17T06:31:56.565304Z"
- }
- },
- "outputs": [],
- "source": [
- "def user_cat_hob_fea(all_data, cols):\n",
- " \"\"\"\n",
- " 用户的主题爱好\n",
- " :param all_data: 数据集\n",
- " :param cols: 用到的特征列\n",
- " \"\"\"\n",
- " user_category_hob_info = all_data[cols]\n",
- " user_category_hob_info = user_category_hob_info.groupby('user_id').agg({list}).reset_index()\n",
- " \n",
- " user_cat_hob_info = pd.DataFrame()\n",
- " user_cat_hob_info['user_id'] = user_category_hob_info['user_id']\n",
- " user_cat_hob_info['cate_list'] = user_category_hob_info['category_id']\n",
- " \n",
- " return user_cat_hob_info"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:32:55.150800Z",
- "start_time": "2020-11-17T06:32:00.740046Z"
- }
- },
- "outputs": [],
- "source": [
- "user_category_hob_cols = ['user_id', 'category_id']\n",
- "user_cat_hob_info = user_cat_hob_fea(all_data, user_category_hob_cols)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的字数偏好特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:48:12.988460Z",
- "start_time": "2020-11-17T06:48:12.547000Z"
- }
- },
- "outputs": [],
- "source": [
- "user_wcou_info = all_data.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
- "user_wcou_info.rename(columns={'words_count': 'words_hbo'}, inplace=True)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "#### 用户的信息特征合并保存"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:48:18.289591Z",
- "start_time": "2020-11-17T06:48:17.084408Z"
- }
- },
- "outputs": [],
- "source": [
- "# 所有表进行合并\n",
- "user_info = pd.merge(user_act_fea, user_device_info, on='user_id')\n",
- "user_info = user_info.merge(user_time_hob_info, on='user_id')\n",
- "user_info = user_info.merge(user_cat_hob_info, on='user_id')\n",
- "user_info = user_info.merge(user_wcou_info, on='user_id')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-17T06:48:26.907785Z",
- "start_time": "2020-11-17T06:48:21.457597Z"
- }
- },
- "outputs": [],
- "source": [
- "# 这样用户特征以后就可以直接读取了\n",
- "user_info.to_csv(save_path + 'user_info.csv', index=False) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户特征直接读入\n",
- "如果前面关于用户的特征工程已经给做完了,后面可以直接读取"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 69,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:15:49.502826Z",
- "start_time": "2020-11-18T03:15:48.062243Z"
- }
- },
- "outputs": [],
- "source": [
- "# 把用户信息直接读入进来\n",
- "user_info = pd.read_csv(save_path + 'user_info.csv')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 70,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:15:56.899635Z",
- "start_time": "2020-11-18T03:15:53.701818Z"
- }
- },
- "outputs": [],
- "source": [
- "if os.path.exists(save_path + 'trn_user_item_feats_df.csv'):\n",
- " trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
- " \n",
- "if os.path.exists(save_path + 'tst_user_item_feats_df.csv'):\n",
- " tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
- "\n",
- "if os.path.exists(save_path + 'val_user_item_feats_df.csv'):\n",
- " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
- "else:\n",
- " val_user_item_feats_df = None"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 71,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:02.739197Z",
- "start_time": "2020-11-18T03:16:01.725028Z"
- }
- },
- "outputs": [],
- "source": [
- "# 拼上用户特征\n",
- "# 下面是线下验证的\n",
- "trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on='user_id',how='left')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 72,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:06.989877Z",
- "start_time": "2020-11-18T03:16:06.983327Z"
- }
- },
- "outputs": [
- {
- "data": {
- "text/plain": [
- "Index(['user_id', 'click_article_id', 'sim0', 'time_diff0', 'word_diff0',\n",
- " 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score', 'rank', 'label',\n",
- " 'click_size', 'time_diff_mean', 'active_level', 'click_environment',\n",
- " 'click_deviceGroup', 'click_os', 'click_country', 'click_region',\n",
- " 'click_referrer_type', 'user_time_hob1', 'user_time_hob2', 'cate_list',\n",
- " 'words_hbo'],\n",
- " dtype='object')"
- ]
- },
- "execution_count": 72,
- "metadata": {},
- "output_type": "execute_result"
- }
- ],
- "source": [
- "trn_user_item_feats_df.columns"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-14T03:13:36.071236Z",
- "start_time": "2020-11-14T03:13:36.050188Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 文章的特征直接读入"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 73,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:12.793070Z",
- "start_time": "2020-11-18T03:16:12.425380Z"
- }
- },
- "outputs": [
- {
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
- ]
- }
- ],
- "source": [
- "articles = pd.read_csv(data_path+'articles.csv')\n",
- "articles = reduce_mem(articles)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 74,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:16:18.118507Z",
- "start_time": "2020-11-18T03:16:16.344338Z"
- }
- },
- "outputs": [],
- "source": [
- "# 拼上文章特征\n",
- "trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- "\n",
- "tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 召回文章的主题是否在用户的爱好里面"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 76,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:17:40.251797Z",
- "start_time": "2020-11-18T03:16:28.130012Z"
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 制作特征和标签, 转成监督学习问题\n",
+ "我们先捋一下基于原始的给定数据, 有哪些特征可以直接利用:\n",
+ "1. 文章的自身特征, category_id表示这文章的类型, created_at_ts表示文章建立的时间, 这个关系着文章的时效性, words_count是文章的字数, 一般字数太长我们不太喜欢点击, 也不排除有人就喜欢读长文。\n",
+ "2. 文章的内容embedding特征, 这个召回的时候用过, 这里可以选择使用, 也可以选择不用, 也可以尝试其他类型的embedding特征, 比如W2V等\n",
+ "3. 用户的设备特征信息\n",
+ "\n",
+ "上面这些直接可以用的特征, 待做完特征工程之后, 直接就可以根据article_id或者是user_id把这些特征加入进去。 但是我们需要先基于召回的结果, 构造一些特征,然后制作标签,形成一个监督学习的数据集。 \n",
+ "构造监督数据集的思路, 根据召回结果, 我们会得到一个{user_id: [可能点击的文章列表]}形式的字典。 那么我们就可以对于每个用户, 每篇可能点击的文章构造一个监督测试集, 比如对于用户user1, 假设得到的他的召回列表{user1: [item1, item2, item3]}, 我们就可以得到三行数据(user1, item1), (user1, item2), (user1, item3)的形式, 这就是监督测试集时候的前两列特征。 \n",
+ "\n",
+ "构造特征的思路是这样, 我们知道每个用户的点击文章是与其历史点击的文章信息是有很大关联的, 比如同一个主题, 相似等等。 所以特征构造这块很重要的一系列特征**是要结合用户的历史点击文章信息**。我们已经得到了每个用户及点击候选文章的两列的一个数据集, 而我们的目的是要预测最后一次点击的文章, 比较自然的一个思路就是和其最后几次点击的文章产生关系, 这样既考虑了其历史点击文章信息, 又得离最后一次点击较近,因为新闻很大的一个特点就是注重时效性。 往往用户的最后一次点击会和其最后几次点击有很大的关联。 所以我们就可以对于每个候选文章, 做出与最后几次点击相关的特征如下:\n",
+ "1. 候选item与最后几次点击的相似性特征(embedding内积) --- 这个直接关联用户历史行为\n",
+ "2. 候选item与最后几次点击的相似性特征的统计特征 --- 统计特征可以减少一些波动和异常\n",
+ "3. 候选item与最后几次点击文章的字数差的特征 --- 可以通过字数看用户偏好\n",
+ "4. 候选item与最后几次点击的文章建立的时间差特征 --- 时间差特征可以看出该用户对于文章的实时性的偏好 \n",
+ "\n",
+ "\n",
+ "还需要考虑一下\n",
+ "**5. 如果使用了youtube召回的话, 我们还可以制作用户与候选item的相似特征**\n",
+ "\n",
+ "\n",
+ "\n",
+ "当然, 上面只是提供了一种基于用户历史行为做特征工程的思路, 大家也可以思维风暴一下,尝试一些其他的特征。 下面我们就实现上面的这些特征的制作, 下面的逻辑是这样:\n",
+ "1. 我们首先获得用户的最后一次点击操作和用户的历史点击, 这个基于我们的日志数据集做\n",
+ "2. 基于用户的历史行为制作特征, 这个会用到用户的历史点击表, 最后的召回列表, 文章的信息表和embedding向量\n",
+ "3. 制作标签, 形成最后的监督学习数据集"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 导包"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:00.341709Z",
+ "start_time": "2020-11-17T09:06:58.723900Z"
+ },
+ "cell_style": "center",
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import pickle\n",
+ "from tqdm import tqdm\n",
+ "import gc, os\n",
+ "import logging\n",
+ "import time\n",
+ "import lightgbm as lgb\n",
+ "from gensim.models import Word2Vec\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# df节省内存函数"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:02.411005Z",
+ "start_time": "2020-11-17T09:07:02.397830Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 节省内存的一个函数\n",
+ "# 减少内存\n",
+ "def reduce_mem(df):\n",
+ " starttime = time.time()\n",
+ " numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']\n",
+ " start_mem = df.memory_usage().sum() / 1024**2\n",
+ " for col in df.columns:\n",
+ " col_type = df[col].dtypes\n",
+ " if col_type in numerics:\n",
+ " c_min = df[col].min()\n",
+ " c_max = df[col].max()\n",
+ " if pd.isnull(c_min) or pd.isnull(c_max):\n",
+ " continue\n",
+ " if str(col_type)[:3] == 'int':\n",
+ " if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:\n",
+ " df[col] = df[col].astype(np.int8)\n",
+ " elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:\n",
+ " df[col] = df[col].astype(np.int16)\n",
+ " elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:\n",
+ " df[col] = df[col].astype(np.int32)\n",
+ " elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:\n",
+ " df[col] = df[col].astype(np.int64)\n",
+ " else:\n",
+ " if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:\n",
+ " df[col] = df[col].astype(np.float16)\n",
+ " elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:\n",
+ " df[col] = df[col].astype(np.float32)\n",
+ " else:\n",
+ " df[col] = df[col].astype(np.float64)\n",
+ " end_mem = df.memory_usage().sum() / 1024**2\n",
+ " print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem,\n",
+ " 100*(start_mem-end_mem)/start_mem,\n",
+ " (time.time()-starttime)/60))\n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:05.031436Z",
+ "start_time": "2020-11-17T09:07:05.026822Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data_path = './data_raw/'\n",
+ "save_path = './temp_results/'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 数据读取\n",
+ "\n",
+ "## 训练和验证集的划分\n",
+ "\n",
+ "划分训练和验证集的原因是为了在线下验证模型参数的好坏,为了完全模拟测试集,我们这里就在训练集中抽取部分用户的所有信息来作为验证集。提前做训练验证集划分的好处就是可以分解制作排序特征时的压力,一次性做整个数据集的排序特征可能时间会比较长。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:07.230308Z",
+ "start_time": "2020-11-17T09:07:07.221081Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# all_click_df指的是训练集\n",
+ "# sample_user_nums 采样作为验证集的用户数量\n",
+ "def trn_val_split(all_click_df, sample_user_nums):\n",
+ " all_click = all_click_df\n",
+ " all_user_ids = all_click.user_id.unique()\n",
+ " \n",
+ " # replace=True表示可以重复抽样,反之不可以\n",
+ " sample_user_ids = np.random.choice(all_user_ids, size=sample_user_nums, replace=False) \n",
+ " \n",
+ " click_val = all_click[all_click['user_id'].isin(sample_user_ids)]\n",
+ " click_trn = all_click[~all_click['user_id'].isin(sample_user_ids)]\n",
+ " \n",
+ " # 将验证集中的最后一次点击给抽取出来作为答案\n",
+ " click_val = click_val.sort_values(['user_id', 'click_timestamp'])\n",
+ " val_ans = click_val.groupby('user_id').tail(1)\n",
+ " \n",
+ " click_val = click_val.groupby('user_id').apply(lambda x: x[:-1]).reset_index(drop=True)\n",
+ " \n",
+ " # 去除val_ans中某些用户只有一个点击数据的情况,如果该用户只有一个点击数据,又被分到ans中,\n",
+ " # 那么训练集中就没有这个用户的点击数据,出现用户冷启动问题,给自己模型验证带来麻烦\n",
+ " val_ans = val_ans[val_ans.user_id.isin(click_val.user_id.unique())] # 保证答案中出现的用户再验证集中还有\n",
+ " click_val = click_val[click_val.user_id.isin(val_ans.user_id.unique())]\n",
+ " \n",
+ " return click_trn, click_val, val_ans"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 获取历史点击和最后一次点击"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:19.202550Z",
+ "start_time": "2020-11-17T09:07:19.195766Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 获取当前数据的历史点击和最后一次点击\n",
+ "def get_hist_and_last_click(all_click):\n",
+ " all_click = all_click.sort_values(by=['user_id', 'click_timestamp'])\n",
+ " click_last_df = all_click.groupby('user_id').tail(1)\n",
+ "\n",
+ " # 如果用户只有一个点击,hist为空了,会导致训练的时候这个用户不可见,此时默认泄露一下\n",
+ " def hist_func(user_df):\n",
+ " if len(user_df) == 1:\n",
+ " return user_df\n",
+ " else:\n",
+ " return user_df[:-1]\n",
+ "\n",
+ " click_hist_df = all_click.groupby('user_id').apply(hist_func).reset_index(drop=True)\n",
+ "\n",
+ " return click_hist_df, click_last_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取训练、验证及测试集"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:21.181211Z",
+ "start_time": "2020-11-17T09:07:21.171338Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_trn_val_tst_data(data_path, offline=True):\n",
+ " if offline:\n",
+ " click_trn_data = pd.read_csv(data_path+'train_click_log.csv') # 训练集用户点击日志\n",
+ " click_trn_data = reduce_mem(click_trn_data)\n",
+ " click_trn, click_val, val_ans = trn_val_split(click_trn_data, sample_user_nums)\n",
+ " else:\n",
+ " click_trn = pd.read_csv(data_path+'train_click_log.csv')\n",
+ " click_trn = reduce_mem(click_trn)\n",
+ " click_val = None\n",
+ " val_ans = None\n",
+ " \n",
+ " click_tst = pd.read_csv(data_path+'testA_click_log.csv')\n",
+ " \n",
+ " return click_trn, click_val, click_tst, val_ans"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取召回列表"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:23.210604Z",
+ "start_time": "2020-11-17T09:07:23.203652Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 返回多路召回列表或者单路召回\n",
+ "def get_recall_list(save_path, single_recall_model=None, multi_recall=False):\n",
+ " if multi_recall:\n",
+ " return pickle.load(open(save_path + 'final_recall_items_dict.pkl', 'rb'))\n",
+ " \n",
+ " if single_recall_model == 'i2i_itemcf':\n",
+ " return pickle.load(open(save_path + 'itemcf_recall_dict.pkl', 'rb'))\n",
+ " elif single_recall_model == 'i2i_emb_itemcf':\n",
+ " return pickle.load(open(save_path + 'itemcf_emb_dict.pkl', 'rb'))\n",
+ " elif single_recall_model == 'user_cf':\n",
+ " return pickle.load(open(save_path + 'youtubednn_usercf_dict.pkl', 'rb'))\n",
+ " elif single_recall_model == 'youtubednn':\n",
+ " return pickle.load(open(save_path + 'youtube_u2i_dict.pkl', 'rb'))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取各种Embedding"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "##### Word2Vec训练及gensim的使用\n",
+ "\n",
+ "Word2Vec主要思想是:一个词的上下文可以很好的表达出词的语义。通过无监督学习产生词向量的方式。word2vec中有两个非常经典的模型:skip-gram和cbow。\n",
+ "\n",
+ "- skip-gram:已知中心词预测周围词。\n",
+ "- cbow:已知周围词预测中心词。\n",
+ "![image-20201106225233086](https://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)\n",
+ "\n",
+ "在使用gensim训练word2vec的时候,有几个比较重要的参数\n",
+ "- size: 表示词向量的维度。\n",
+ "- window:决定了目标词会与多远距离的上下文产生关系。\n",
+ "- sg: 如果是0,则是CBOW模型,是1则是Skip-Gram模型。\n",
+ "- workers: 表示训练时候的线程数量\n",
+ "- min_count: 设置最小的\n",
+ "- iter: 训练时遍历整个数据集的次数\n",
+ "\n",
+ "**注意**\n",
+ "1. 训练的时候输入的语料库一定要是字符组成的二维数组,如:[['北', '京', '你', '好'], ['上', '海', '你', '好']]\n",
+ "2. 使用模型的时候有一些默认值,可以通过在Jupyter里面通过`Word2Vec??`查看\n",
+ "\n",
+ "\n",
+ "下面是个简单的测试样例:\n",
+ "```\n",
+ "from gensim.models import Word2Vec\n",
+ "doc = [['30760', '157507'],\n",
+ " ['289197', '63746'],\n",
+ " ['36162', '168401'],\n",
+ " ['50644', '36162']]\n",
+ "w2v = Word2Vec(docs, size=12, sg=1, window=2, seed=2020, workers=2, min_count=1, iter=1)\n",
+ "\n",
+ "# 查看'30760'表示的词向量\n",
+ "w2v['30760']\n",
+ "```\n",
+ "\n",
+ "skip-gram和cbow的详细原理可以参考下面的博客:\n",
+ "- [word2vec原理(一) CBOW与Skip-Gram模型基础](https://www.cnblogs.com/pinard/p/7160330.html) \n",
+ "- [word2vec原理(二) 基于Hierarchical Softmax的模型](https://www.cnblogs.com/pinard/p/7160330.html) \n",
+ "- [word2vec原理(三) 基于Negative Sampling的模型](https://www.cnblogs.com/pinard/p/7249903.html) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:26.676173Z",
+ "start_time": "2020-11-17T09:07:26.667926Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def trian_item_word2vec(click_df, embed_size=64, save_name='item_w2v_emb.pkl', split_char=' '):\n",
+ " click_df = click_df.sort_values('click_timestamp')\n",
+ " # 只有转换成字符串才可以进行训练\n",
+ " click_df['click_article_id'] = click_df['click_article_id'].astype(str)\n",
+ " # 转换成句子的形式\n",
+ " docs = click_df.groupby(['user_id'])['click_article_id'].apply(lambda x: list(x)).reset_index()\n",
+ " docs = docs['click_article_id'].values.tolist()\n",
+ "\n",
+ " # 为了方便查看训练的进度,这里设定一个log信息\n",
+ " logging.basicConfig(format='%(asctime)s:%(levelname)s:%(message)s', level=logging.INFO)\n",
+ "\n",
+ " # 这里的参数对训练得到的向量影响也很大,默认负采样为5\n",
+ " w2v = Word2Vec(docs, size=16, sg=1, window=5, seed=2020, workers=24, min_count=1, iter=1)\n",
+ " \n",
+ " # 保存成字典的形式\n",
+ " item_w2v_emb_dict = {k: w2v[k] for k in click_df['click_article_id']}\n",
+ " pickle.dump(item_w2v_emb_dict, open(save_path + 'item_w2v_emb.pkl', 'wb'))\n",
+ " \n",
+ " return item_w2v_emb_dict"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:27.285690Z",
+ "start_time": "2020-11-17T09:07:27.276646Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 可以通过字典查询对应的item的Embedding\n",
+ "def get_embedding(save_path, all_click_df):\n",
+ " if os.path.exists(save_path + 'item_content_emb.pkl'):\n",
+ " item_content_emb_dict = pickle.load(open(save_path + 'item_content_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " print('item_content_emb.pkl 文件不存在...')\n",
+ " \n",
+ " # w2v Embedding是需要提前训练好的\n",
+ " if os.path.exists(save_path + 'item_w2v_emb.pkl'):\n",
+ " item_w2v_emb_dict = pickle.load(open(save_path + 'item_w2v_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " item_w2v_emb_dict = trian_item_word2vec(all_click_df)\n",
+ " \n",
+ " if os.path.exists(save_path + 'item_youtube_emb.pkl'):\n",
+ " item_youtube_emb_dict = pickle.load(open(save_path + 'item_youtube_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " print('item_youtube_emb.pkl 文件不存在...')\n",
+ " \n",
+ " if os.path.exists(save_path + 'user_youtube_emb.pkl'):\n",
+ " user_youtube_emb_dict = pickle.load(open(save_path + 'user_youtube_emb.pkl', 'rb'))\n",
+ " else:\n",
+ " print('user_youtube_emb.pkl 文件不存在...')\n",
+ " \n",
+ " return item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取文章信息"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:28.391797Z",
+ "start_time": "2020-11-17T09:07:28.386650Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_article_info_df():\n",
+ " article_info_df = pd.read_csv(data_path + 'articles.csv')\n",
+ " article_info_df = reduce_mem(article_info_df)\n",
+ " \n",
+ " return article_info_df"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取数据"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:07:32.362045Z",
+ "start_time": "2020-11-17T09:07:29.490413Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-- Mem. usage decreased to 23.34 Mb (69.4% reduction),time spend:0.00 min\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 这里offline的online的区别就是验证集是否为空\n",
+ "click_trn, click_val, click_tst, val_ans = get_trn_val_tst_data(data_path, offline=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:10.378966Z",
+ "start_time": "2020-11-17T09:07:32.468580Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "click_trn_hist, click_trn_last = get_hist_and_last_click(click_trn)\n",
+ "\n",
+ "if click_val is not None:\n",
+ " click_val_hist, click_val_last = click_val, val_ans\n",
+ "else:\n",
+ " click_val_hist, click_val_last = None, None\n",
+ " \n",
+ "click_tst_hist = click_tst"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 对训练数据做负采样\n",
+ "\n",
+ "通过召回我们将数据转换成三元组的形式(user1, item1, label)的形式,观察发现正负样本差距极度不平衡,我们可以先对负样本进行下采样,下采样的目的一方面缓解了正负样本比例的问题,另一方面也减小了我们做排序特征的压力,我们在做负采样的时候又有哪些东西是需要注意的呢?\n",
+ "\n",
+ "1. 只对负样本进行下采样(如果有比较好的正样本扩充的方法其实也是可以考虑的)\n",
+ "2. 负采样之后,保证所有的用户和文章仍然出现在采样之后的数据中\n",
+ "3. 下采样的比例可以根据实际情况人为的控制\n",
+ "4. 做完负采样之后,更新此时新的用户召回文章列表,因为后续做特征的时候可能用到相对位置的信息。\n",
+ "\n",
+ "其实负采样也可以留在后面做完特征在进行,这里由于做排序特征太慢了,所以把负采样的环节提到前面了。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:36.096678Z",
+ "start_time": "2020-11-17T09:11:36.090911Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 将召回列表转换成df的形式\n",
+ "def recall_dict_2_df(recall_list_dict):\n",
+ " df_row_list = [] # [user, item, score]\n",
+ " for user, recall_list in tqdm(recall_list_dict.items()):\n",
+ " for item, score in recall_list:\n",
+ " df_row_list.append([user, item, score])\n",
+ " \n",
+ " col_names = ['user_id', 'sim_item', 'score']\n",
+ " recall_list_df = pd.DataFrame(df_row_list, columns=col_names)\n",
+ " \n",
+ " return recall_list_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:37.668844Z",
+ "start_time": "2020-11-17T09:11:37.659774Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 负采样函数,这里可以控制负采样时的比例, 这里给了一个默认的值\n",
+ "def neg_sample_recall_data(recall_items_df, sample_rate=0.001):\n",
+ " pos_data = recall_items_df[recall_items_df['label'] == 1]\n",
+ " neg_data = recall_items_df[recall_items_df['label'] == 0]\n",
+ " \n",
+ " print('pos_data_num:', len(pos_data), 'neg_data_num:', len(neg_data), 'pos/neg:', len(pos_data)/len(neg_data))\n",
+ " \n",
+ " # 分组采样函数\n",
+ " def neg_sample_func(group_df):\n",
+ " neg_num = len(group_df)\n",
+ " sample_num = max(int(neg_num * sample_rate), 1) # 保证最少有一个\n",
+ " sample_num = min(sample_num, 5) # 保证最多不超过5个,这里可以根据实际情况进行选择\n",
+ " return group_df.sample(n=sample_num, replace=True)\n",
+ " \n",
+ " # 对用户进行负采样,保证所有用户都在采样后的数据中\n",
+ " neg_data_user_sample = neg_data.groupby('user_id', group_keys=False).apply(neg_sample_func)\n",
+ " # 对文章进行负采样,保证所有文章都在采样后的数据中\n",
+ " neg_data_item_sample = neg_data.groupby('sim_item', group_keys=False).apply(neg_sample_func)\n",
+ " \n",
+ " # 将上述两种情况下的采样数据合并\n",
+ " neg_data_new = neg_data_user_sample.append(neg_data_item_sample)\n",
+ " # 由于上述两个操作是分开的,可能将两个相同的数据给重复选择了,所以需要对合并后的数据进行去重\n",
+ " neg_data_new = neg_data_new.sort_values(['user_id', 'score']).drop_duplicates(['user_id', 'sim_item'], keep='last')\n",
+ " \n",
+ " # 将正样本数据合并\n",
+ " data_new = pd.concat([pos_data, neg_data_new], ignore_index=True)\n",
+ " \n",
+ " return data_new"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:39.481715Z",
+ "start_time": "2020-11-17T09:11:39.475144Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 召回数据打标签\n",
+ "def get_rank_label_df(recall_list_df, label_df, is_test=False):\n",
+ " # 测试集是没有标签了,为了后面代码同一一些,这里直接给一个负数替代\n",
+ " if is_test:\n",
+ " recall_list_df['label'] = -1\n",
+ " return recall_list_df\n",
+ " \n",
+ " label_df = label_df.rename(columns={'click_article_id': 'sim_item'})\n",
+ " recall_list_df_ = recall_list_df.merge(label_df[['user_id', 'sim_item', 'click_timestamp']], \\\n",
+ " how='left', on=['user_id', 'sim_item'])\n",
+ " recall_list_df_['label'] = recall_list_df_['click_timestamp'].apply(lambda x: 0.0 if np.isnan(x) else 1.0)\n",
+ " del recall_list_df_['click_timestamp']\n",
+ " \n",
+ " return recall_list_df_"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T09:11:41.555566Z",
+ "start_time": "2020-11-17T09:11:41.546766Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_user_recall_item_label_df(click_trn_hist, click_val_hist, click_tst_hist,click_trn_last, click_val_last, recall_list_df):\n",
+ " # 获取训练数据的召回列表\n",
+ " trn_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_trn_hist['user_id'].unique())]\n",
+ " # 训练数据打标签\n",
+ " trn_user_item_label_df = get_rank_label_df(trn_user_items_df, click_trn_last, is_test=False)\n",
+ " # 训练数据负采样\n",
+ " trn_user_item_label_df = neg_sample_recall_data(trn_user_item_label_df)\n",
+ " \n",
+ " if click_val is not None:\n",
+ " val_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_val_hist['user_id'].unique())]\n",
+ " val_user_item_label_df = get_rank_label_df(val_user_items_df, click_val_last, is_test=False)\n",
+ " val_user_item_label_df = neg_sample_recall_data(val_user_item_label_df)\n",
+ " else:\n",
+ " val_user_item_label_df = None\n",
+ " \n",
+ " # 测试数据不需要进行负采样,直接对所有的召回商品进行打-1标签\n",
+ " tst_user_items_df = recall_list_df[recall_list_df['user_id'].isin(click_tst_hist['user_id'].unique())]\n",
+ " tst_user_item_label_df = get_rank_label_df(tst_user_items_df, None, is_test=True)\n",
+ " \n",
+ " return trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 56,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:23:35.357045Z",
+ "start_time": "2020-11-17T17:23:12.378284Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 250000/250000 [00:12<00:00, 20689.39it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 读取召回列表\n",
+ "recall_list_dict = get_recall_list(save_path, single_recall_model='i2i_itemcf') # 这里只选择了单路召回的结果,也可以选择多路召回结果\n",
+ "# 将召回数据转换成df\n",
+ "recall_list_df = recall_dict_2_df(recall_list_dict)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 57,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:29:04.598214Z",
+ "start_time": "2020-11-17T17:23:40.001052Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "pos_data_num: 64190 neg_data_num: 1935810 pos/neg: 0.03315924600038227\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 给训练验证数据打标签,并负采样(这一部分时间比较久)\n",
+ "trn_user_item_label_df, val_user_item_label_df, tst_user_item_label_df = get_user_recall_item_label_df(click_trn_hist, \n",
+ " click_val_hist, \n",
+ " click_tst_hist,\n",
+ " click_trn_last, \n",
+ " click_val_last, \n",
+ " recall_list_df)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:23:11.642944Z",
+ "start_time": "2020-11-17T17:23:08.475Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_label_df.label"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 将召回数据转换成字典"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 58,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:36:22.800449Z",
+ "start_time": "2020-11-17T17:36:22.794670Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 将最终的召回的df数据转换成字典的形式做排序特征\n",
+ "def make_tuple_func(group_df):\n",
+ " row_data = []\n",
+ " for name, row_df in group_df.iterrows():\n",
+ " row_data.append((row_df['sim_item'], row_df['score'], row_df['label']))\n",
+ " \n",
+ " return row_data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 59,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T17:40:05.991819Z",
+ "start_time": "2020-11-17T17:36:26.536429Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_label_tuples = trn_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
+ "trn_user_item_label_tuples_dict = dict(zip(trn_user_item_label_tuples['user_id'], trn_user_item_label_tuples[0]))\n",
+ "\n",
+ "if val_user_item_label_df is not None:\n",
+ " val_user_item_label_tuples = val_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
+ " val_user_item_label_tuples_dict = dict(zip(val_user_item_label_tuples['user_id'], val_user_item_label_tuples[0]))\n",
+ "else:\n",
+ " val_user_item_label_tuples_dict = None\n",
+ " \n",
+ "tst_user_item_label_tuples = tst_user_item_label_df.groupby('user_id').apply(make_tuple_func).reset_index()\n",
+ "tst_user_item_label_tuples_dict = dict(zip(tst_user_item_label_tuples['user_id'], tst_user_item_label_tuples[0]))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T07:59:53.141560Z",
+ "start_time": "2020-11-17T07:59:53.133599Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 特征工程"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 制作与用户历史行为相关特征\n",
+ "对于每个用户召回的每个商品, 做特征。 具体步骤如下:\n",
+ "* 对于每个用户, 获取最后点击的N个商品的item_id, \n",
+ " * 对于该用户的每个召回商品, 计算与上面最后N次点击商品的相似度的和(最大, 最小,均值), 时间差特征,相似性特征,字数差特征,与该用户的相似性特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 60,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T01:07:47.268035Z",
+ "start_time": "2020-11-18T01:07:47.250449Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 下面基于data做历史相关的特征\n",
+ "def create_feature(users_id, recall_list, click_hist_df, articles_info, articles_emb, user_emb=None, N=1):\n",
+ " \"\"\"\n",
+ " 基于用户的历史行为做相关特征\n",
+ " :param users_id: 用户id\n",
+ " :param recall_list: 对于每个用户召回的候选文章列表\n",
+ " :param click_hist_df: 用户的历史点击信息\n",
+ " :param articles_info: 文章信息\n",
+ " :param articles_emb: 文章的embedding向量, 这个可以用item_content_emb, item_w2v_emb, item_youtube_emb\n",
+ " :param user_emb: 用户的embedding向量, 这个是user_youtube_emb, 如果没有也可以不用, 但要注意如果要用的话, articles_emb就要用item_youtube_emb的形式, 这样维度才一样\n",
+ " :param N: 最近的N次点击 由于testA日志里面很多用户只存在一次历史点击, 所以为了不产生空值,默认是1\n",
+ " \"\"\"\n",
+ " \n",
+ " # 建立一个二维列表保存结果, 后面要转成DataFrame\n",
+ " all_user_feas = []\n",
+ " i = 0\n",
+ " for user_id in tqdm(users_id):\n",
+ " # 该用户的最后N次点击\n",
+ " hist_user_items = click_hist_df[click_hist_df['user_id']==user_id]['click_article_id'][-N:]\n",
+ " \n",
+ " # 遍历该用户的召回列表\n",
+ " for rank, (article_id, score, label) in enumerate(recall_list[user_id]):\n",
+ " # 该文章建立时间, 字数\n",
+ " a_create_time = articles_info[articles_info['article_id']==article_id]['created_at_ts'].values[0]\n",
+ " a_words_count = articles_info[articles_info['article_id']==article_id]['words_count'].values[0]\n",
+ " single_user_fea = [user_id, article_id]\n",
+ " # 计算与最后点击的商品的相似度的和, 最大值和最小值, 均值\n",
+ " sim_fea = []\n",
+ " time_fea = []\n",
+ " word_fea = []\n",
+ " # 遍历用户的最后N次点击文章\n",
+ " for hist_item in hist_user_items:\n",
+ " b_create_time = articles_info[articles_info['article_id']==hist_item]['created_at_ts'].values[0]\n",
+ " b_words_count = articles_info[articles_info['article_id']==hist_item]['words_count'].values[0]\n",
+ " \n",
+ " sim_fea.append(np.dot(articles_emb[hist_item], articles_emb[article_id]))\n",
+ " time_fea.append(abs(a_create_time-b_create_time))\n",
+ " word_fea.append(abs(a_words_count-b_words_count))\n",
+ " \n",
+ " single_user_fea.extend(sim_fea) # 相似性特征\n",
+ " single_user_fea.extend(time_fea) # 时间差特征\n",
+ " single_user_fea.extend(word_fea) # 字数差特征\n",
+ " single_user_fea.extend([max(sim_fea), min(sim_fea), sum(sim_fea), sum(sim_fea) / len(sim_fea)]) # 相似性的统计特征\n",
+ " \n",
+ " if user_emb: # 如果用户向量有的话, 这里计算该召回文章与用户的相似性特征 \n",
+ " single_user_fea.append(np.dot(user_emb[user_id], articles_emb[article_id]))\n",
+ " \n",
+ " single_user_fea.extend([score, rank, label]) \n",
+ " # 加入到总的表中\n",
+ " all_user_feas.append(single_user_fea)\n",
+ " \n",
+ " # 定义列名\n",
+ " id_cols = ['user_id', 'click_article_id']\n",
+ " sim_cols = ['sim' + str(i) for i in range(N)]\n",
+ " time_cols = ['time_diff' + str(i) for i in range(N)]\n",
+ " word_cols = ['word_diff' + str(i) for i in range(N)]\n",
+ " sat_cols = ['sim_max', 'sim_min', 'sim_sum', 'sim_mean']\n",
+ " user_item_sim_cols = ['user_item_sim'] if user_emb else []\n",
+ " user_score_rank_label = ['score', 'rank', 'label']\n",
+ " cols = id_cols + sim_cols + time_cols + word_cols + sat_cols + user_item_sim_cols + user_score_rank_label\n",
+ " \n",
+ " # 转成DataFrame\n",
+ " df = pd.DataFrame( all_user_feas, columns=cols)\n",
+ " \n",
+ " return df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 61,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T01:08:17.531694Z",
+ "start_time": "2020-11-18T01:08:10.754702Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
+ ]
+ }
+ ],
+ "source": [
+ "article_info_df = get_article_info_df()\n",
+ "all_click = click_trn.append(click_tst)\n",
+ "item_content_emb_dict, item_w2v_emb_dict, item_youtube_emb_dict, user_youtube_emb_dict = get_embedding(save_path, all_click)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 62,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:06:22.709350Z",
+ "start_time": "2020-11-18T01:08:39.923811Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stderr",
+ "output_type": "stream",
+ "text": [
+ "100%|██████████| 200000/200000 [50:16<00:00, 66.31it/s] \n",
+ "100%|██████████| 50000/50000 [1:07:21<00:00, 12.37it/s]\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 获取训练验证及测试数据中召回列文章相关特征\n",
+ "trn_user_item_feats_df = create_feature(trn_user_item_label_tuples_dict.keys(), trn_user_item_label_tuples_dict, \\\n",
+ " click_trn_hist, article_info_df, item_content_emb_dict)\n",
+ "\n",
+ "if val_user_item_label_tuples_dict is not None:\n",
+ " val_user_item_feats_df = create_feature(val_user_item_label_tuples_dict.keys(), val_user_item_label_tuples_dict, \\\n",
+ " click_val_hist, article_info_df, item_content_emb_dict)\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "tst_user_item_feats_df = create_feature(tst_user_item_label_tuples_dict.keys(), tst_user_item_label_tuples_dict, \\\n",
+ " click_tst_hist, article_info_df, item_content_emb_dict)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 63,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:13:58.573422Z",
+ "start_time": "2020-11-18T03:13:40.157228Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 保存一份省的每次都要重新跑,每次跑的时间都比较长\n",
+ "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
+ "\n",
+ "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:14:22.838154Z",
+ "start_time": "2020-11-18T03:14:22.828212Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 用户和文章特征\n",
+ "### 用户相关特征\n",
+ "这一块,正式进行特征工程,既要拼接上已有的特征, 也会做更多的特征出来,我们来梳理一下已有的特征和可构造特征:\n",
+ "1. 文章自身的特征, 文章字数,文章创建时间, 文章的embedding (articles表中)\n",
+ "2. 用户点击环境特征, 那些设备的特征(这个在df中)\n",
+ "3. 对于用户和商品还可以构造的特征:\n",
+ " * 基于用户的点击文章次数和点击时间构造可以表现用户活跃度的特征\n",
+ " * 基于文章被点击次数和时间构造可以反映文章热度的特征\n",
+ " * 用户的时间统计特征: 根据其点击的历史文章列表的点击时间和文章的创建时间做统计特征,比如求均值, 这个可以反映用户对于文章时效的偏好\n",
+ " * 用户的主题爱好特征, 对于用户点击的历史文章主题进行一个统计, 然后对于当前文章看看是否属于用户已经点击过的主题\n",
+ " * 用户的字数爱好特征, 对于用户点击的历史文章的字数统计, 求一个均值"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:16:37.637495Z",
+ "start_time": "2020-11-14T03:16:37.618229Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "click_tst.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:09:11.675550Z",
+ "start_time": "2020-11-17T02:09:10.265134Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取文章特征\n",
+ "articles = pd.read_csv(data_path+'articles.csv')\n",
+ "articles = reduce_mem(articles)\n",
+ "\n",
+ "# 日志数据,就是前面的所有数据\n",
+ "if click_val is not None:\n",
+ " all_data = click_trn.append(click_val)\n",
+ "all_data = click_trn.append(click_tst)\n",
+ "all_data = reduce_mem(all_data)\n",
+ "\n",
+ "# 拼上文章信息\n",
+ "all_data = all_data.merge(articles, left_on='click_article_id', right_on='article_id')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:17:12.256244Z",
+ "start_time": "2020-11-14T03:17:12.250452Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "all_data.shape"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 分析一下点击时间和点击文章的次数,区分用户活跃度\n",
+ "如果某个用户点击文章之间的时间间隔比较小, 同时点击的文章次数很多的话, 那么我们认为这种用户一般就是活跃用户, 当然衡量用户活跃度的方式可能多种多样, 这里我们只提供其中一种,我们写一个函数, 得到可以衡量用户活跃度的特征,逻辑如下:\n",
+ "1. 首先根据用户user_id分组, 对于每个用户,计算点击文章的次数, 两两点击文章时间间隔的均值\n",
+ "2. 把点击次数取倒数和时间间隔的均值统一归一化,然后两者相加合并,该值越小, 说明用户越活跃\n",
+ "3. 注意, 上面两两点击文章的时间间隔均值, 会出现如果用户只点击了一次的情况,这时候时间间隔均值那里会出现空值, 对于这种情况最后特征那里给个大数进行区分\n",
+ "\n",
+ "这个的衡量标准就是先把点击的次数取到数然后归一化, 然后点击的时间差归一化, 然后两者相加进行合并, 该值越小, 说明被点击的次数越多, 且间隔时间短。 "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:28:55.336058Z",
+ "start_time": "2020-11-17T02:28:55.324332Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ " def active_level(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作区分用户活跃度的特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " data = all_data[cols]\n",
+ " data.sort_values(['user_id', 'click_timestamp'], inplace=True)\n",
+ " user_act = pd.DataFrame(data.groupby('user_id', as_index=False)[['click_article_id', 'click_timestamp']].\\\n",
+ " agg({'click_article_id':np.size, 'click_timestamp': {list}}).values, columns=['user_id', 'click_size', 'click_timestamp'])\n",
+ " \n",
+ " # 计算时间间隔的均值\n",
+ " def time_diff_mean(l):\n",
+ " if len(l) == 1:\n",
+ " return 1\n",
+ " else:\n",
+ " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
+ " \n",
+ " user_act['time_diff_mean'] = user_act['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
+ " \n",
+ " # 点击次数取倒数\n",
+ " user_act['click_size'] = 1 / user_act['click_size']\n",
+ " \n",
+ " # 两者归一化\n",
+ " user_act['click_size'] = (user_act['click_size'] - user_act['click_size'].min()) / (user_act['click_size'].max() - user_act['click_size'].min())\n",
+ " user_act['time_diff_mean'] = (user_act['time_diff_mean'] - user_act['time_diff_mean'].min()) / (user_act['time_diff_mean'].max() - user_act['time_diff_mean'].min()) \n",
+ " user_act['active_level'] = user_act['click_size'] + user_act['time_diff_mean']\n",
+ " \n",
+ " user_act['user_id'] = user_act['user_id'].astype('int')\n",
+ " del user_act['click_timestamp']\n",
+ " \n",
+ " return user_act"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:30:12.696060Z",
+ "start_time": "2020-11-17T02:29:01.523837Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_act_fea = active_level(all_data, ['user_id', 'click_article_id', 'click_timestamp'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:28:53.996742Z",
+ "start_time": "2020-11-17T02:09:18.374Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_act_fea.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 分析一下点击时间和被点击文章的次数, 衡量文章热度特征\n",
+ "和上面同样的思路, 如果一篇文章在很短的时间间隔之内被点击了很多次, 说明文章比较热门,实现的逻辑和上面的基本一致, 只不过这里是按照点击的文章进行分组:\n",
+ "1. 根据文章进行分组, 对于每篇文章的用户, 计算点击的时间间隔\n",
+ "2. 将用户的数量取倒数, 然后用户的数量和时间间隔归一化, 然后相加得到热度特征, 该值越小, 说明被点击的次数越大且时间间隔越短, 文章比较热\n",
+ "\n",
+ "当然, 这只是给出一种判断文章热度的一种方法, 这里大家也可以头脑风暴一下"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:41:26.398567Z",
+ "start_time": "2020-11-17T02:41:26.386668Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ " def hot_level(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作衡量文章热度的特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " data = all_data[cols]\n",
+ " data.sort_values(['click_article_id', 'click_timestamp'], inplace=True)\n",
+ " article_hot = pd.DataFrame(data.groupby('click_article_id', as_index=False)[['user_id', 'click_timestamp']].\\\n",
+ " agg({'user_id':np.size, 'click_timestamp': {list}}).values, columns=['click_article_id', 'user_num', 'click_timestamp'])\n",
+ " \n",
+ " # 计算被点击时间间隔的均值\n",
+ " def time_diff_mean(l):\n",
+ " if len(l) == 1:\n",
+ " return 1\n",
+ " else:\n",
+ " return np.mean([j-i for i, j in list(zip(l[:-1], l[1:]))])\n",
+ " \n",
+ " article_hot['time_diff_mean'] = article_hot['click_timestamp'].apply(lambda x: time_diff_mean(x))\n",
+ " \n",
+ " # 点击次数取倒数\n",
+ " article_hot['user_num'] = 1 / article_hot['user_num']\n",
+ " \n",
+ " # 两者归一化\n",
+ " article_hot['user_num'] = (article_hot['user_num'] - article_hot['user_num'].min()) / (article_hot['user_num'].max() - article_hot['user_num'].min())\n",
+ " article_hot['time_diff_mean'] = (article_hot['time_diff_mean'] - article_hot['time_diff_mean'].min()) / (article_hot['time_diff_mean'].max() - article_hot['time_diff_mean'].min()) \n",
+ " article_hot['hot_level'] = article_hot['user_num'] + article_hot['time_diff_mean']\n",
+ " \n",
+ " article_hot['click_article_id'] = article_hot['click_article_id'].astype('int')\n",
+ " \n",
+ " del article_hot['click_timestamp']\n",
+ " \n",
+ " return article_hot"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T02:41:44.635900Z",
+ "start_time": "2020-11-17T02:41:31.473032Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "article_hot_fea = hot_level(all_data, ['user_id', 'click_article_id', 'click_timestamp']) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:19:54.775290Z",
+ "start_time": "2020-11-14T03:19:54.763699Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "article_hot_fea.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的系列习惯\n",
+ "这个基于原来的日志表做一个类似于article的那种DataFrame, 存放用户特有的信息, 主要包括点击习惯, 爱好特征之类的\n",
+ "* 用户的设备习惯, 这里取最常用的设备(众数)\n",
+ "* 用户的时间习惯: 根据其点击过得历史文章的时间来做一个统计(这个感觉最好是把时间戳里的时间特征的h特征提出来,看看用户习惯一天的啥时候点击文章), 但这里先用转换的时间吧, 求个均值\n",
+ "* 用户的爱好特征, 对于用户点击的历史文章主题进行用户的爱好判别, 更偏向于哪几个主题, 这个最好是multi-hot进行编码, 先试试行不\n",
+ "* 用户文章的字数差特征, 用户的爱好文章的字数习惯\n",
+ "\n",
+ "这些就是对用户进行分组, 然后统计即可"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的设备习惯"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T04:22:48.877978Z",
+ "start_time": "2020-11-17T04:22:48.872049Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def device_fea(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作用户的设备特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " user_device_info = all_data[cols]\n",
+ " \n",
+ " # 用众数来表示每个用户的设备信息\n",
+ " user_device_info = user_device_info.groupby('user_id').agg(lambda x: x.value_counts().index[0]).reset_index()\n",
+ " \n",
+ " return user_device_info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T05:27:10.897473Z",
+ "start_time": "2020-11-17T04:49:33.214865Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 设备特征(这里时间会比较长)\n",
+ "device_cols = ['user_id', 'click_environment', 'click_deviceGroup', 'click_os', 'click_country', 'click_region', 'click_referrer_type']\n",
+ "user_device_info = device_fea(all_data, device_cols)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T04:20:39.765842Z",
+ "start_time": "2020-11-14T04:20:39.747087Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_device_info.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的时间习惯"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:11:50.889905Z",
+ "start_time": "2020-11-17T06:11:50.882653Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def user_time_hob_fea(all_data, cols):\n",
+ " \"\"\"\n",
+ " 制作用户的时间习惯特征\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " user_time_hob_info = all_data[cols]\n",
+ " \n",
+ " # 先把时间戳进行归一化\n",
+ " mm = MinMaxScaler()\n",
+ " user_time_hob_info['click_timestamp'] = mm.fit_transform(user_time_hob_info[['click_timestamp']])\n",
+ " user_time_hob_info['created_at_ts'] = mm.fit_transform(user_time_hob_info[['created_at_ts']])\n",
+ "\n",
+ " user_time_hob_info = user_time_hob_info.groupby('user_id').agg('mean').reset_index()\n",
+ " \n",
+ " user_time_hob_info.rename(columns={'click_timestamp': 'user_time_hob1', 'created_at_ts': 'user_time_hob2'}, inplace=True)\n",
+ " return user_time_hob_info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:31:51.646110Z",
+ "start_time": "2020-11-17T06:31:51.171431Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_time_hob_cols = ['user_id', 'click_timestamp', 'created_at_ts']\n",
+ "user_time_hob_info = user_time_hob_fea(all_data, user_time_hob_cols)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的主题爱好\n",
+ "这里先把用户点击的文章属于的主题转成一个列表, 后面再总的汇总的时候单独制作一个特征, 就是文章的主题如果属于这里面, 就是1, 否则就是0。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:31:56.571088Z",
+ "start_time": "2020-11-17T06:31:56.565304Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def user_cat_hob_fea(all_data, cols):\n",
+ " \"\"\"\n",
+ " 用户的主题爱好\n",
+ " :param all_data: 数据集\n",
+ " :param cols: 用到的特征列\n",
+ " \"\"\"\n",
+ " user_category_hob_info = all_data[cols]\n",
+ " user_category_hob_info = user_category_hob_info.groupby('user_id').agg({list}).reset_index()\n",
+ " \n",
+ " user_cat_hob_info = pd.DataFrame()\n",
+ " user_cat_hob_info['user_id'] = user_category_hob_info['user_id']\n",
+ " user_cat_hob_info['cate_list'] = user_category_hob_info['category_id']\n",
+ " \n",
+ " return user_cat_hob_info"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:32:55.150800Z",
+ "start_time": "2020-11-17T06:32:00.740046Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_category_hob_cols = ['user_id', 'category_id']\n",
+ "user_cat_hob_info = user_cat_hob_fea(all_data, user_category_hob_cols)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的字数偏好特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:48:12.988460Z",
+ "start_time": "2020-11-17T06:48:12.547000Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "user_wcou_info = all_data.groupby('user_id')['words_count'].agg('mean').reset_index()\n",
+ "user_wcou_info.rename(columns={'words_count': 'words_hbo'}, inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "#### 用户的信息特征合并保存"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:48:18.289591Z",
+ "start_time": "2020-11-17T06:48:17.084408Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 所有表进行合并\n",
+ "user_info = pd.merge(user_act_fea, user_device_info, on='user_id')\n",
+ "user_info = user_info.merge(user_time_hob_info, on='user_id')\n",
+ "user_info = user_info.merge(user_cat_hob_info, on='user_id')\n",
+ "user_info = user_info.merge(user_wcou_info, on='user_id')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-17T06:48:26.907785Z",
+ "start_time": "2020-11-17T06:48:21.457597Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 这样用户特征以后就可以直接读取了\n",
+ "user_info.to_csv(save_path + 'user_info.csv', index=False) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户特征直接读入\n",
+ "如果前面关于用户的特征工程已经给做完了,后面可以直接读取"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 69,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:15:49.502826Z",
+ "start_time": "2020-11-18T03:15:48.062243Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 把用户信息直接读入进来\n",
+ "user_info = pd.read_csv(save_path + 'user_info.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 70,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:15:56.899635Z",
+ "start_time": "2020-11-18T03:15:53.701818Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "if os.path.exists(save_path + 'trn_user_item_feats_df.csv'):\n",
+ " trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
+ " \n",
+ "if os.path.exists(save_path + 'tst_user_item_feats_df.csv'):\n",
+ " tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
+ "\n",
+ "if os.path.exists(save_path + 'val_user_item_feats_df.csv'):\n",
+ " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
+ "else:\n",
+ " val_user_item_feats_df = None"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 71,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:02.739197Z",
+ "start_time": "2020-11-18T03:16:01.725028Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 拼上用户特征\n",
+ "# 下面是线下验证的\n",
+ "trn_user_item_feats_df = trn_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df = val_user_item_feats_df.merge(user_info, on='user_id', how='left')\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "tst_user_item_feats_df = tst_user_item_feats_df.merge(user_info, on='user_id',how='left')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 72,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:06.989877Z",
+ "start_time": "2020-11-18T03:16:06.983327Z"
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "Index(['user_id', 'click_article_id', 'sim0', 'time_diff0', 'word_diff0',\n",
+ " 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score', 'rank', 'label',\n",
+ " 'click_size', 'time_diff_mean', 'active_level', 'click_environment',\n",
+ " 'click_deviceGroup', 'click_os', 'click_country', 'click_region',\n",
+ " 'click_referrer_type', 'user_time_hob1', 'user_time_hob2', 'cate_list',\n",
+ " 'words_hbo'],\n",
+ " dtype='object')"
+ ]
+ },
+ "execution_count": 72,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "trn_user_item_feats_df.columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-14T03:13:36.071236Z",
+ "start_time": "2020-11-14T03:13:36.050188Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 文章的特征直接读入"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 73,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:12.793070Z",
+ "start_time": "2020-11-18T03:16:12.425380Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "-- Mem. usage decreased to 5.56 Mb (50.0% reduction),time spend:0.00 min\n"
+ ]
+ }
+ ],
+ "source": [
+ "articles = pd.read_csv(data_path+'articles.csv')\n",
+ "articles = reduce_mem(articles)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 74,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:16:18.118507Z",
+ "start_time": "2020-11-18T03:16:16.344338Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 拼上文章特征\n",
+ "trn_user_item_feats_df = trn_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df = val_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ "\n",
+ "tst_user_item_feats_df = tst_user_item_feats_df.merge(articles, left_on='click_article_id', right_on='article_id')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 召回文章的主题是否在用户的爱好里面"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 76,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:17:40.251797Z",
+ "start_time": "2020-11-18T03:16:28.130012Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_feats_df['is_cat_hab'] = trn_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df['is_cat_hab'] = val_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ "tst_user_item_feats_df['is_cat_hab'] = tst_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 77,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:19:30.451200Z",
+ "start_time": "2020-11-18T03:19:30.411225Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 线下验证\n",
+ "del trn_user_item_feats_df['cate_list']\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " del val_user_item_feats_df['cate_list']\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "del tst_user_item_feats_df['cate_list']\n",
+ "\n",
+ "del trn_user_item_feats_df['article_id']\n",
+ "\n",
+ "if val_user_item_feats_df is not None:\n",
+ " del val_user_item_feats_df['article_id']\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "del tst_user_item_feats_df['article_id']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 保存特征"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 78,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T03:20:08.560942Z",
+ "start_time": "2020-11-18T03:19:35.601095Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "# 训练验证特征\n",
+ "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
+ "if val_user_item_feats_df is not None:\n",
+ " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
+ "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 总结\n",
+ "特征工程和数据清洗转换是比赛中至关重要的一块, 因为**数据和特征决定了机器学习的上限,而算法和模型只是逼近这个上限而已**,所以特征工程的好坏往往决定着最后的结果,**特征工程**可以一步增强数据的表达能力,通过构造新特征,我们可以挖掘出数据的更多信息,使得数据的表达能力进一步放大。 在本节内容中,我们主要是先通过制作特征和标签把预测问题转成了监督学习问题,然后围绕着用户画像和文章画像进行一系列特征的制作, 此外,为了保证正负样本的数据均衡,我们还学习了负采样就技术等。当然本节内容只是对构造特征提供了一些思路,也请学习者们在学习过程中开启头脑风暴,尝试更多的构造特征的方法,也欢迎我们一块探讨和交流。\n",
+ "\n",
+ "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- },
- "outputs": [],
- "source": [
- "trn_user_item_feats_df['is_cat_hab'] = trn_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df['is_cat_hab'] = val_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- "tst_user_item_feats_df['is_cat_hab'] = tst_user_item_feats_df.apply(lambda x: 1 if x.category_id in set(x.cate_list) else 0, axis=1)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 77,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:19:30.451200Z",
- "start_time": "2020-11-18T03:19:30.411225Z"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.5"
+ },
+ "tianchi_metadata": {
+ "competitions": [],
+ "datasets": [],
+ "description": "",
+ "notebookId": "130010",
+ "source": "dsw"
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "218px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "outputs": [],
- "source": [
- "# 线下验证\n",
- "del trn_user_item_feats_df['cate_list']\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " del val_user_item_feats_df['cate_list']\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "del tst_user_item_feats_df['cate_list']\n",
- "\n",
- "del trn_user_item_feats_df['article_id']\n",
- "\n",
- "if val_user_item_feats_df is not None:\n",
- " del val_user_item_feats_df['article_id']\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "del tst_user_item_feats_df['article_id']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 保存特征"
- ]
},
- {
- "cell_type": "code",
- "execution_count": 78,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T03:20:08.560942Z",
- "start_time": "2020-11-18T03:19:35.601095Z"
- },
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "# 训练验证特征\n",
- "trn_user_item_feats_df.to_csv(save_path + 'trn_user_item_feats_df.csv', index=False)\n",
- "if val_user_item_feats_df is not None:\n",
- " val_user_item_feats_df.to_csv(save_path + 'val_user_item_feats_df.csv', index=False)\n",
- "tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=False)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 总结\n",
- "特征工程和数据清洗转换是比赛中至关重要的一块, 因为**数据和特征决定了机器学习的上限,而算法和模型只是逼近这个上限而已**,所以特征工程的好坏往往决定着最后的结果,**特征工程**可以一步增强数据的表达能力,通过构造新特征,我们可以挖掘出数据的更多信息,使得数据的表达能力进一步放大。 在本节内容中,我们主要是先通过制作特征和标签把预测问题转成了监督学习问题,然后围绕着用户画像和文章画像进行一系列特征的制作, 此外,为了保证正负样本的数据均衡,我们还学习了负采样就技术等。当然本节内容只是对构造特征提供了一些思路,也请学习者们在学习过程中开启头脑风暴,尝试更多的构造特征的方法,也欢迎我们一块探讨和交流。\n",
- "\n",
- "**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.5"
- },
- "tianchi_metadata": {
- "competitions": [],
- "datasets": [],
- "description": "",
- "notebookId": "130010",
- "source": "dsw"
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "218px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
- },
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
- }
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
\ No newline at end of file
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb"
index 5f96e246b..3af0aa71f 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/jupyter/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.ipynb"
@@ -1,2689 +1,2689 @@
{
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 排序模型\n",
- "通过召回的操作, 我们已经进行了问题规模的缩减, 对于每个用户, 选择出了N篇文章作为了候选集,并基于召回的候选集构建了与用户历史相关的特征,以及用户本身的属性特征,文章本省的属性特征,以及用户与文章之间的特征,下面就是使用机器学习模型来对构造好的特征进行学习,然后对测试集进行预测,得到测试集中的每个候选集用户点击的概率,返回点击概率最大的topk个文章,作为最终的结果。\n",
- "\n",
- "排序阶段选择了三个比较有代表性的排序模型,它们分别是:\n",
- "\n",
- "1. LGB的排序模型\n",
- "2. LGB的分类模型\n",
- "3. 深度学习的分类模型DIN\n",
- "\n",
- "得到了最终的排序模型输出的结果之后,还选择了两种比较经典的模型集成的方法:\n",
- "\n",
- "1. 输出结果加权融合\n",
- "2. Staking(将模型的输出结果再使用一个简单模型进行预测)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 1,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:20:39.770642Z",
- "start_time": "2020-11-18T04:20:38.500875Z"
- }
- },
- "outputs": [],
- "source": [
- "import numpy as np\n",
- "import pandas as pd\n",
- "import pickle\n",
- "from tqdm import tqdm\n",
- "import gc, os\n",
- "import time\n",
- "from datetime import datetime\n",
- "import lightgbm as lgb\n",
- "from sklearn.preprocessing import MinMaxScaler\n",
- "import warnings\n",
- "warnings.filterwarnings('ignore')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 读取排序特征"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 2,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:20:41.843180Z",
- "start_time": "2020-11-18T04:20:41.837287Z"
- }
- },
- "outputs": [],
- "source": [
- "data_path = './data_raw/'\n",
- "save_path = './temp_results/'\n",
- "offline = False"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 3,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:20:53.358138Z",
- "start_time": "2020-11-18T04:20:44.232944Z"
- }
- },
- "outputs": [],
- "source": [
- "# 重新读取数据的时候,发现click_article_id是一个浮点数,所以将其转换成int类型\n",
- "trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
- "trn_user_item_feats_df['click_article_id'] = trn_user_item_feats_df['click_article_id'].astype(int)\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
- " val_user_item_feats_df['click_article_id'] = val_user_item_feats_df['click_article_id'].astype(int)\n",
- "else:\n",
- " val_user_item_feats_df = None\n",
- " \n",
- "tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
- "tst_user_item_feats_df['click_article_id'] = tst_user_item_feats_df['click_article_id'].astype(int)\n",
- "\n",
- "# 做特征的时候为了方便,给测试集也打上了一个无效的标签,这里直接删掉就行\n",
- "del tst_user_item_feats_df['label']"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 返回排序后的结果"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 4,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:01.809368Z",
- "start_time": "2020-11-18T04:21:01.799641Z"
- }
- },
- "outputs": [],
- "source": [
- "def submit(recall_df, topk=5, model_name=None):\n",
- " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
- " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 判断是不是每个用户都有5篇文章及以上\n",
- " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
- " assert tmp.min() >= topk\n",
- " \n",
- " del recall_df['pred_score']\n",
- " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
- " \n",
- " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
- " # 按照提交格式定义列名\n",
- " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
- " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
- " \n",
- " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
- " submit.to_csv(save_name, index=False, header=True)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 5,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:04.332198Z",
- "start_time": "2020-11-18T04:21:04.325020Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序结果归一化\n",
- "def norm_sim(sim_df, weight=0.0):\n",
- " # print(sim_df.head())\n",
- " min_sim = sim_df.min()\n",
- " max_sim = sim_df.max()\n",
- " if max_sim == min_sim:\n",
- " sim_df = sim_df.apply(lambda sim: 1.0)\n",
- " else:\n",
- " sim_df = sim_df.apply(lambda sim: 1.0 * (sim - min_sim) / (max_sim - min_sim))\n",
- "\n",
- " sim_df = sim_df.apply(lambda sim: sim + weight) # plus one\n",
- " return sim_df"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## LGB排序模型"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 6,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:07.787698Z",
- "start_time": "2020-11-18T04:21:07.536514Z"
- }
- },
- "outputs": [],
- "source": [
- "# 防止中间出错之后重新读取数据\n",
- "trn_user_item_feats_df_rank_model = trn_user_item_feats_df.copy()\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_rank_model = val_user_item_feats_df.copy()\n",
- " \n",
- "tst_user_item_feats_df_rank_model = tst_user_item_feats_df.copy()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 7,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:10.839656Z",
- "start_time": "2020-11-18T04:21:10.833109Z"
- }
- },
- "outputs": [],
- "source": [
- "# 定义特征列\n",
- "lgb_cols = ['sim0', 'time_diff0', 'word_diff0','sim_max', 'sim_min', 'sim_sum', \n",
- " 'sim_mean', 'score','click_size', 'time_diff_mean', 'active_level',\n",
- " 'click_environment','click_deviceGroup', 'click_os', 'click_country', \n",
- " 'click_region','click_referrer_type', 'user_time_hob1', 'user_time_hob2',\n",
- " 'words_hbo', 'category_id', 'created_at_ts','words_count']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 8,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:14.126608Z",
- "start_time": "2020-11-18T04:21:13.493653Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序模型分组\n",
- "trn_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
- "g_train = trn_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
- " g_val = val_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 9,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:16.136151Z",
- "start_time": "2020-11-18T04:21:16.124444Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序模型定义\n",
- "lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 10,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:22.965433Z",
- "start_time": "2020-11-18T04:21:17.799127Z"
- }
- },
- "outputs": [],
- "source": [
- "# 排序模型训练\n",
- "if offline:\n",
- " lgb_ranker.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'], group=g_train,\n",
- " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
- " eval_group= [g_val], eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
- "else:\n",
- " lgb_ranker.fit(trn_user_item_feats_df[lgb_cols], trn_user_item_feats_df['label'], group=g_train)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 11,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:28.616665Z",
- "start_time": "2020-11-18T04:21:24.672280Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型预测\n",
- "tst_user_item_feats_df['pred_score'] = lgb_ranker.predict(tst_user_item_feats_df[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
- "\n",
- "# 将这里的排序结果保存一份,用户后面的模型融合\n",
- "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_ranker_score.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 12,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:21:40.253692Z",
- "start_time": "2020-11-18T04:21:30.546587Z"
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 排序模型\n",
+ "通过召回的操作, 我们已经进行了问题规模的缩减, 对于每个用户, 选择出了N篇文章作为了候选集,并基于召回的候选集构建了与用户历史相关的特征,以及用户本身的属性特征,文章本省的属性特征,以及用户与文章之间的特征,下面就是使用机器学习模型来对构造好的特征进行学习,然后对测试集进行预测,得到测试集中的每个候选集用户点击的概率,返回点击概率最大的topk个文章,作为最终的结果。\n",
+ "\n",
+ "排序阶段选择了三个比较有代表性的排序模型,它们分别是:\n",
+ "\n",
+ "1. LGB的排序模型\n",
+ "2. LGB的分类模型\n",
+ "3. 深度学习的分类模型DIN\n",
+ "\n",
+ "得到了最终的排序模型输出的结果之后,还选择了两种比较经典的模型集成的方法:\n",
+ "\n",
+ "1. 输出结果加权融合\n",
+ "2. Staking(将模型的输出结果再使用一个简单模型进行预测)"
+ ]
},
- "scrolled": true
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_ranker')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 13,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:22:26.195838Z",
- "start_time": "2020-11-18T04:21:46.115002Z"
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:20:39.770642Z",
+ "start_time": "2020-11-18T04:20:38.500875Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "import pickle\n",
+ "from tqdm import tqdm\n",
+ "import gc, os\n",
+ "import time\n",
+ "from datetime import datetime\n",
+ "import lightgbm as lgb\n",
+ "from sklearn.preprocessing import MinMaxScaler\n",
+ "import warnings\n",
+ "warnings.filterwarnings('ignore')"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[1]\tvalid_0's ndcg@1: 0.909975\tvalid_0's ndcg@2: 0.963068\tvalid_0's ndcg@3: 0.96533\tvalid_0's ndcg@4: 0.965729\tvalid_0's ndcg@5: 0.965864\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9143\tvalid_0's ndcg@2: 0.964711\tvalid_0's ndcg@3: 0.966961\tvalid_0's ndcg@4: 0.967338\tvalid_0's ndcg@5: 0.967483\n",
- "[3]\tvalid_0's ndcg@1: 0.9181\tvalid_0's ndcg@2: 0.966114\tvalid_0's ndcg@3: 0.968289\tvalid_0's ndcg@4: 0.968773\tvalid_0's ndcg@5: 0.96887\n",
- "[4]\tvalid_0's ndcg@1: 0.925575\tvalid_0's ndcg@2: 0.969093\tvalid_0's ndcg@3: 0.971193\tvalid_0's ndcg@4: 0.971603\tvalid_0's ndcg@5: 0.97169\n",
- "[5]\tvalid_0's ndcg@1: 0.9267\tvalid_0's ndcg@2: 0.969635\tvalid_0's ndcg@3: 0.97166\tvalid_0's ndcg@4: 0.972037\tvalid_0's ndcg@5: 0.972133\n",
- "[6]\tvalid_0's ndcg@1: 0.927\tvalid_0's ndcg@2: 0.969682\tvalid_0's ndcg@3: 0.971757\tvalid_0's ndcg@4: 0.972134\tvalid_0's ndcg@5: 0.972231\n",
- "[7]\tvalid_0's ndcg@1: 0.928825\tvalid_0's ndcg@2: 0.970451\tvalid_0's ndcg@3: 0.972476\tvalid_0's ndcg@4: 0.97282\tvalid_0's ndcg@5: 0.972927\n",
- "[8]\tvalid_0's ndcg@1: 0.930025\tvalid_0's ndcg@2: 0.970988\tvalid_0's ndcg@3: 0.972951\tvalid_0's ndcg@4: 0.973295\tvalid_0's ndcg@5: 0.973402\n",
- "[9]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971347\tvalid_0's ndcg@3: 0.973384\tvalid_0's ndcg@4: 0.973707\tvalid_0's ndcg@5: 0.973794\n",
- "[10]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.971385\tvalid_0's ndcg@3: 0.973372\tvalid_0's ndcg@4: 0.973717\tvalid_0's ndcg@5: 0.973794\n",
- "[11]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.971433\tvalid_0's ndcg@3: 0.973333\tvalid_0's ndcg@4: 0.973699\tvalid_0's ndcg@5: 0.973767\n",
- "[12]\tvalid_0's ndcg@1: 0.93145\tvalid_0's ndcg@2: 0.971656\tvalid_0's ndcg@3: 0.973493\tvalid_0's ndcg@4: 0.973881\tvalid_0's ndcg@5: 0.973949\n",
- "[13]\tvalid_0's ndcg@1: 0.932525\tvalid_0's ndcg@2: 0.971927\tvalid_0's ndcg@3: 0.973839\tvalid_0's ndcg@4: 0.974227\tvalid_0's ndcg@5: 0.974304\n",
- "[14]\tvalid_0's ndcg@1: 0.932575\tvalid_0's ndcg@2: 0.971898\tvalid_0's ndcg@3: 0.973823\tvalid_0's ndcg@4: 0.974243\tvalid_0's ndcg@5: 0.97432\n",
- "[15]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972239\tvalid_0's ndcg@3: 0.974189\tvalid_0's ndcg@4: 0.974587\tvalid_0's ndcg@5: 0.974665\n",
- "[16]\tvalid_0's ndcg@1: 0.933475\tvalid_0's ndcg@2: 0.972309\tvalid_0's ndcg@3: 0.974209\tvalid_0's ndcg@4: 0.974596\tvalid_0's ndcg@5: 0.974674\n",
- "[17]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972369\tvalid_0's ndcg@3: 0.974307\tvalid_0's ndcg@4: 0.974684\tvalid_0's ndcg@5: 0.974761\n",
- "[18]\tvalid_0's ndcg@1: 0.9339\tvalid_0's ndcg@2: 0.972497\tvalid_0's ndcg@3: 0.974372\tvalid_0's ndcg@4: 0.974749\tvalid_0's ndcg@5: 0.974836\n",
- "[19]\tvalid_0's ndcg@1: 0.9345\tvalid_0's ndcg@2: 0.972845\tvalid_0's ndcg@3: 0.974645\tvalid_0's ndcg@4: 0.974979\tvalid_0's ndcg@5: 0.975085\n",
- "[20]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.97484\tvalid_0's ndcg@4: 0.975174\tvalid_0's ndcg@5: 0.975271\n",
- "[21]\tvalid_0's ndcg@1: 0.935\tvalid_0's ndcg@2: 0.973092\tvalid_0's ndcg@3: 0.97488\tvalid_0's ndcg@4: 0.975192\tvalid_0's ndcg@5: 0.975289\n",
- "[22]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.974988\tvalid_0's ndcg@4: 0.975289\tvalid_0's ndcg@5: 0.975386\n",
- "[23]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974824\tvalid_0's ndcg@4: 0.975136\tvalid_0's ndcg@5: 0.975223\n",
- "[24]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973274\tvalid_0's ndcg@3: 0.975087\tvalid_0's ndcg@4: 0.975388\tvalid_0's ndcg@5: 0.975475\n",
- "[25]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973345\tvalid_0's ndcg@3: 0.97512\tvalid_0's ndcg@4: 0.975443\tvalid_0's ndcg@5: 0.97553\n",
- "[26]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.975\tvalid_0's ndcg@4: 0.975313\tvalid_0's ndcg@5: 0.9754\n",
- "[27]\tvalid_0's ndcg@1: 0.935175\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.974983\tvalid_0's ndcg@4: 0.975295\tvalid_0's ndcg@5: 0.975382\n",
- "[28]\tvalid_0's ndcg@1: 0.935425\tvalid_0's ndcg@2: 0.973328\tvalid_0's ndcg@3: 0.975041\tvalid_0's ndcg@4: 0.975374\tvalid_0's ndcg@5: 0.975471\n",
- "[29]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973225\tvalid_0's ndcg@3: 0.974963\tvalid_0's ndcg@4: 0.975297\tvalid_0's ndcg@5: 0.975403\n",
- "[30]\tvalid_0's ndcg@1: 0.9353\tvalid_0's ndcg@2: 0.973235\tvalid_0's ndcg@3: 0.97501\tvalid_0's ndcg@4: 0.975311\tvalid_0's ndcg@5: 0.975418\n",
- "[31]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973361\tvalid_0's ndcg@3: 0.975099\tvalid_0's ndcg@4: 0.975422\tvalid_0's ndcg@5: 0.975528\n",
- "[32]\tvalid_0's ndcg@1: 0.9364\tvalid_0's ndcg@2: 0.973641\tvalid_0's ndcg@3: 0.975391\tvalid_0's ndcg@4: 0.975714\tvalid_0's ndcg@5: 0.97582\n",
- "[33]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973751\tvalid_0's ndcg@3: 0.975501\tvalid_0's ndcg@4: 0.975824\tvalid_0's ndcg@5: 0.975931\n",
- "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.973902\tvalid_0's ndcg@3: 0.975677\tvalid_0's ndcg@4: 0.975989\tvalid_0's ndcg@5: 0.976095\n",
- "[35]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974105\tvalid_0's ndcg@3: 0.975892\tvalid_0's ndcg@4: 0.976194\tvalid_0's ndcg@5: 0.9763\n",
- "[36]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974184\tvalid_0's ndcg@3: 0.975984\tvalid_0's ndcg@4: 0.976296\tvalid_0's ndcg@5: 0.976402\n",
- "[37]\tvalid_0's ndcg@1: 0.93845\tvalid_0's ndcg@2: 0.974366\tvalid_0's ndcg@3: 0.976166\tvalid_0's ndcg@4: 0.976467\tvalid_0's ndcg@5: 0.976574\n",
- "[38]\tvalid_0's ndcg@1: 0.938925\tvalid_0's ndcg@2: 0.974557\tvalid_0's ndcg@3: 0.976332\tvalid_0's ndcg@4: 0.976655\tvalid_0's ndcg@5: 0.976751\n",
- "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974471\tvalid_0's ndcg@3: 0.976234\tvalid_0's ndcg@4: 0.976557\tvalid_0's ndcg@5: 0.976653\n",
- "[40]\tvalid_0's ndcg@1: 0.938325\tvalid_0's ndcg@2: 0.974335\tvalid_0's ndcg@3: 0.97611\tvalid_0's ndcg@4: 0.976433\tvalid_0's ndcg@5: 0.97653\n",
- "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.974669\tvalid_0's ndcg@3: 0.976431\tvalid_0's ndcg@4: 0.976743\tvalid_0's ndcg@5: 0.97683\n",
- "[42]\tvalid_0's ndcg@1: 0.939375\tvalid_0's ndcg@2: 0.974833\tvalid_0's ndcg@3: 0.976546\tvalid_0's ndcg@4: 0.976858\tvalid_0's ndcg@5: 0.976945\n",
- "[43]\tvalid_0's ndcg@1: 0.939625\tvalid_0's ndcg@2: 0.974878\tvalid_0's ndcg@3: 0.976628\tvalid_0's ndcg@4: 0.97694\tvalid_0's ndcg@5: 0.977027\n",
- "[44]\tvalid_0's ndcg@1: 0.9395\tvalid_0's ndcg@2: 0.974832\tvalid_0's ndcg@3: 0.97657\tvalid_0's ndcg@4: 0.976893\tvalid_0's ndcg@5: 0.97698\n",
- "[45]\tvalid_0's ndcg@1: 0.939775\tvalid_0's ndcg@2: 0.974949\tvalid_0's ndcg@3: 0.976674\tvalid_0's ndcg@4: 0.976997\tvalid_0's ndcg@5: 0.977084\n",
- "[46]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.974945\tvalid_0's ndcg@3: 0.976708\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977107\n",
- "[47]\tvalid_0's ndcg@1: 0.94005\tvalid_0's ndcg@2: 0.975004\tvalid_0's ndcg@3: 0.976766\tvalid_0's ndcg@4: 0.977078\tvalid_0's ndcg@5: 0.977175\n",
- "[48]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
- "[49]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
- "[50]\tvalid_0's ndcg@1: 0.9405\tvalid_0's ndcg@2: 0.975264\tvalid_0's ndcg@3: 0.976989\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
- "[51]\tvalid_0's ndcg@1: 0.941125\tvalid_0's ndcg@2: 0.975526\tvalid_0's ndcg@3: 0.977226\tvalid_0's ndcg@4: 0.977528\tvalid_0's ndcg@5: 0.977605\n",
- "[52]\tvalid_0's ndcg@1: 0.941\tvalid_0's ndcg@2: 0.97548\tvalid_0's ndcg@3: 0.977193\tvalid_0's ndcg@4: 0.977484\tvalid_0's ndcg@5: 0.977561\n",
- "[53]\tvalid_0's ndcg@1: 0.9411\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.977259\tvalid_0's ndcg@4: 0.977539\tvalid_0's ndcg@5: 0.977616\n",
- "[54]\tvalid_0's ndcg@1: 0.9412\tvalid_0's ndcg@2: 0.975712\tvalid_0's ndcg@3: 0.977299\tvalid_0's ndcg@4: 0.97759\tvalid_0's ndcg@5: 0.977667\n",
- "[55]\tvalid_0's ndcg@1: 0.94155\tvalid_0's ndcg@2: 0.975841\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977719\tvalid_0's ndcg@5: 0.977797\n",
- "[56]\tvalid_0's ndcg@1: 0.941825\tvalid_0's ndcg@2: 0.975943\tvalid_0's ndcg@3: 0.97753\tvalid_0's ndcg@4: 0.977821\tvalid_0's ndcg@5: 0.977898\n",
- "[57]\tvalid_0's ndcg@1: 0.9416\tvalid_0's ndcg@2: 0.975891\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977741\tvalid_0's ndcg@5: 0.977818\n",
- "[58]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977494\tvalid_0's ndcg@4: 0.977795\tvalid_0's ndcg@5: 0.977873\n",
- "[59]\tvalid_0's ndcg@1: 0.942025\tvalid_0's ndcg@2: 0.975985\tvalid_0's ndcg@3: 0.977547\tvalid_0's ndcg@4: 0.977881\tvalid_0's ndcg@5: 0.977958\n",
- "[60]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975994\tvalid_0's ndcg@3: 0.977569\tvalid_0's ndcg@4: 0.977892\tvalid_0's ndcg@5: 0.977969\n",
- "[61]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977559\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.97796\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 读取排序特征"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[62]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976027\tvalid_0's ndcg@3: 0.97764\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.978028\n",
- "[63]\tvalid_0's ndcg@1: 0.942125\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977622\tvalid_0's ndcg@4: 0.977912\tvalid_0's ndcg@5: 0.977999\n",
- "[64]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977793\tvalid_0's ndcg@4: 0.978105\tvalid_0's ndcg@5: 0.978192\n",
- "[65]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976227\tvalid_0's ndcg@3: 0.977802\tvalid_0's ndcg@4: 0.978125\tvalid_0's ndcg@5: 0.978212\n",
- "[66]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976132\tvalid_0's ndcg@3: 0.977695\tvalid_0's ndcg@4: 0.978018\tvalid_0's ndcg@5: 0.978105\n",
- "[67]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976092\tvalid_0's ndcg@3: 0.977679\tvalid_0's ndcg@4: 0.978002\tvalid_0's ndcg@5: 0.978089\n",
- "[68]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976148\tvalid_0's ndcg@3: 0.977698\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.978108\n",
- "[69]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976123\tvalid_0's ndcg@3: 0.977686\tvalid_0's ndcg@4: 0.978009\tvalid_0's ndcg@5: 0.978096\n",
- "[70]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976222\tvalid_0's ndcg@3: 0.977785\tvalid_0's ndcg@4: 0.978097\tvalid_0's ndcg@5: 0.978184\n",
- "[71]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976188\tvalid_0's ndcg@3: 0.977763\tvalid_0's ndcg@4: 0.978075\tvalid_0's ndcg@5: 0.978162\n",
- "[72]\tvalid_0's ndcg@1: 0.9427\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977809\tvalid_0's ndcg@4: 0.978121\tvalid_0's ndcg@5: 0.978208\n",
- "[73]\tvalid_0's ndcg@1: 0.9428\tvalid_0's ndcg@2: 0.976255\tvalid_0's ndcg@3: 0.977843\tvalid_0's ndcg@4: 0.978155\tvalid_0's ndcg@5: 0.978242\n",
- "[74]\tvalid_0's ndcg@1: 0.94295\tvalid_0's ndcg@2: 0.97631\tvalid_0's ndcg@3: 0.977898\tvalid_0's ndcg@4: 0.97821\tvalid_0's ndcg@5: 0.978297\n",
- "[75]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976329\tvalid_0's ndcg@3: 0.977941\tvalid_0's ndcg@4: 0.978232\tvalid_0's ndcg@5: 0.978319\n",
- "[76]\tvalid_0's ndcg@1: 0.9433\tvalid_0's ndcg@2: 0.976471\tvalid_0's ndcg@3: 0.978059\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978437\n",
- "[77]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976416\tvalid_0's ndcg@3: 0.977991\tvalid_0's ndcg@4: 0.978314\tvalid_0's ndcg@5: 0.978381\n",
- "[78]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976657\tvalid_0's ndcg@3: 0.978194\tvalid_0's ndcg@4: 0.978517\tvalid_0's ndcg@5: 0.978585\n",
- "[79]\tvalid_0's ndcg@1: 0.94365\tvalid_0's ndcg@2: 0.976663\tvalid_0's ndcg@3: 0.978188\tvalid_0's ndcg@4: 0.978501\tvalid_0's ndcg@5: 0.978578\n",
- "[80]\tvalid_0's ndcg@1: 0.943725\tvalid_0's ndcg@2: 0.976628\tvalid_0's ndcg@3: 0.978203\tvalid_0's ndcg@4: 0.978515\tvalid_0's ndcg@5: 0.978593\n",
- "[81]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97672\tvalid_0's ndcg@3: 0.978295\tvalid_0's ndcg@4: 0.978607\tvalid_0's ndcg@5: 0.978685\n",
- "[82]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978397\tvalid_0's ndcg@4: 0.97872\tvalid_0's ndcg@5: 0.978787\n",
- "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976788\tvalid_0's ndcg@3: 0.978375\tvalid_0's ndcg@4: 0.978698\tvalid_0's ndcg@5: 0.978766\n",
- "[84]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.97679\tvalid_0's ndcg@3: 0.97839\tvalid_0's ndcg@4: 0.978702\tvalid_0's ndcg@5: 0.97878\n",
- "[85]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.976809\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978723\tvalid_0's ndcg@5: 0.9788\n",
- "[86]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976939\tvalid_0's ndcg@3: 0.978502\tvalid_0's ndcg@4: 0.978814\tvalid_0's ndcg@5: 0.978891\n",
- "[87]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.976976\tvalid_0's ndcg@3: 0.978551\tvalid_0's ndcg@4: 0.978852\tvalid_0's ndcg@5: 0.97893\n",
- "[88]\tvalid_0's ndcg@1: 0.944925\tvalid_0's ndcg@2: 0.977102\tvalid_0's ndcg@3: 0.978677\tvalid_0's ndcg@4: 0.978968\tvalid_0's ndcg@5: 0.979045\n",
- "[89]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978758\tvalid_0's ndcg@4: 0.979048\tvalid_0's ndcg@5: 0.979126\n",
- "[90]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.978735\tvalid_0's ndcg@4: 0.979026\tvalid_0's ndcg@5: 0.979104\n",
- "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978858\tvalid_0's ndcg@4: 0.979138\tvalid_0's ndcg@5: 0.979215\n",
- "[92]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.977267\tvalid_0's ndcg@3: 0.978905\tvalid_0's ndcg@4: 0.979174\tvalid_0's ndcg@5: 0.979251\n",
- "[93]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977193\tvalid_0's ndcg@3: 0.978818\tvalid_0's ndcg@4: 0.979098\tvalid_0's ndcg@5: 0.979176\n",
- "[94]\tvalid_0's ndcg@1: 0.94545\tvalid_0's ndcg@2: 0.97728\tvalid_0's ndcg@3: 0.97888\tvalid_0's ndcg@4: 0.97916\tvalid_0's ndcg@5: 0.979238\n",
- "[95]\tvalid_0's ndcg@1: 0.9458\tvalid_0's ndcg@2: 0.977394\tvalid_0's ndcg@3: 0.979006\tvalid_0's ndcg@4: 0.979286\tvalid_0's ndcg@5: 0.979364\n",
- "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979114\tvalid_0's ndcg@4: 0.979394\tvalid_0's ndcg@5: 0.979472\n",
- "[97]\tvalid_0's ndcg@1: 0.946475\tvalid_0's ndcg@2: 0.977659\tvalid_0's ndcg@3: 0.979259\tvalid_0's ndcg@4: 0.979539\tvalid_0's ndcg@5: 0.979616\n",
- "[98]\tvalid_0's ndcg@1: 0.94675\tvalid_0's ndcg@2: 0.97776\tvalid_0's ndcg@3: 0.97936\tvalid_0's ndcg@4: 0.979651\tvalid_0's ndcg@5: 0.979719\n",
- "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
- "[100]\tvalid_0's ndcg@1: 0.9468\tvalid_0's ndcg@2: 0.977794\tvalid_0's ndcg@3: 0.979369\tvalid_0's ndcg@4: 0.979671\tvalid_0's ndcg@5: 0.979739\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
- "[1]\tvalid_0's ndcg@1: 0.909075\tvalid_0's ndcg@2: 0.963019\tvalid_0's ndcg@3: 0.965069\tvalid_0's ndcg@4: 0.965543\tvalid_0's ndcg@5: 0.965601\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9123\tvalid_0's ndcg@2: 0.964273\tvalid_0's ndcg@3: 0.966248\tvalid_0's ndcg@4: 0.966722\tvalid_0's ndcg@5: 0.966789\n",
- "[3]\tvalid_0's ndcg@1: 0.915075\tvalid_0's ndcg@2: 0.965691\tvalid_0's ndcg@3: 0.967466\tvalid_0's ndcg@4: 0.967854\tvalid_0's ndcg@5: 0.967922\n",
- "[4]\tvalid_0's ndcg@1: 0.91845\tvalid_0's ndcg@2: 0.967047\tvalid_0's ndcg@3: 0.968735\tvalid_0's ndcg@4: 0.969133\tvalid_0's ndcg@5: 0.969201\n",
- "[5]\tvalid_0's ndcg@1: 0.92355\tvalid_0's ndcg@2: 0.968961\tvalid_0's ndcg@3: 0.970674\tvalid_0's ndcg@4: 0.97104\tvalid_0's ndcg@5: 0.971098\n",
- "[6]\tvalid_0's ndcg@1: 0.9253\tvalid_0's ndcg@2: 0.969607\tvalid_0's ndcg@3: 0.971345\tvalid_0's ndcg@4: 0.971689\tvalid_0's ndcg@5: 0.971747\n",
- "[7]\tvalid_0's ndcg@1: 0.926225\tvalid_0's ndcg@2: 0.969933\tvalid_0's ndcg@3: 0.971708\tvalid_0's ndcg@4: 0.972031\tvalid_0's ndcg@5: 0.972079\n",
- "[8]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.970104\tvalid_0's ndcg@3: 0.971804\tvalid_0's ndcg@4: 0.972116\tvalid_0's ndcg@5: 0.972184\n",
- "[9]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970682\tvalid_0's ndcg@3: 0.972307\tvalid_0's ndcg@4: 0.972598\tvalid_0's ndcg@5: 0.972675\n",
- "[10]\tvalid_0's ndcg@1: 0.92775\tvalid_0's ndcg@2: 0.970653\tvalid_0's ndcg@3: 0.972316\tvalid_0's ndcg@4: 0.972617\tvalid_0's ndcg@5: 0.972685\n",
- "[11]\tvalid_0's ndcg@1: 0.9283\tvalid_0's ndcg@2: 0.97084\tvalid_0's ndcg@3: 0.97254\tvalid_0's ndcg@4: 0.97281\tvalid_0's ndcg@5: 0.972887\n",
- "[12]\tvalid_0's ndcg@1: 0.9287\tvalid_0's ndcg@2: 0.971051\tvalid_0's ndcg@3: 0.972701\tvalid_0's ndcg@4: 0.97297\tvalid_0's ndcg@5: 0.973048\n",
- "[13]\tvalid_0's ndcg@1: 0.9297\tvalid_0's ndcg@2: 0.971389\tvalid_0's ndcg@3: 0.973001\tvalid_0's ndcg@4: 0.973313\tvalid_0's ndcg@5: 0.9734\n",
- "[14]\tvalid_0's ndcg@1: 0.92955\tvalid_0's ndcg@2: 0.971444\tvalid_0's ndcg@3: 0.972994\tvalid_0's ndcg@4: 0.973284\tvalid_0's ndcg@5: 0.973371\n",
- "[15]\tvalid_0's ndcg@1: 0.930225\tvalid_0's ndcg@2: 0.97174\tvalid_0's ndcg@3: 0.973253\tvalid_0's ndcg@4: 0.973543\tvalid_0's ndcg@5: 0.97363\n",
- "[16]\tvalid_0's ndcg@1: 0.930425\tvalid_0's ndcg@2: 0.971798\tvalid_0's ndcg@3: 0.973298\tvalid_0's ndcg@4: 0.97361\tvalid_0's ndcg@5: 0.973698\n",
- "[17]\tvalid_0's ndcg@1: 0.93125\tvalid_0's ndcg@2: 0.971992\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973903\tvalid_0's ndcg@5: 0.97398\n",
- "[18]\tvalid_0's ndcg@1: 0.931925\tvalid_0's ndcg@2: 0.972257\tvalid_0's ndcg@3: 0.973845\tvalid_0's ndcg@4: 0.974146\tvalid_0's ndcg@5: 0.974224\n",
- "[19]\tvalid_0's ndcg@1: 0.932375\tvalid_0's ndcg@2: 0.972376\tvalid_0's ndcg@3: 0.974038\tvalid_0's ndcg@4: 0.974318\tvalid_0's ndcg@5: 0.974376\n",
- "[20]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.972269\tvalid_0's ndcg@3: 0.973907\tvalid_0's ndcg@4: 0.974187\tvalid_0's ndcg@5: 0.974245\n",
- "[21]\tvalid_0's ndcg@1: 0.932725\tvalid_0's ndcg@2: 0.972568\tvalid_0's ndcg@3: 0.974181\tvalid_0's ndcg@4: 0.974471\tvalid_0's ndcg@5: 0.974529\n",
- "[22]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972735\tvalid_0's ndcg@3: 0.974298\tvalid_0's ndcg@4: 0.974599\tvalid_0's ndcg@5: 0.974657\n",
- "[23]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972642\tvalid_0's ndcg@3: 0.974255\tvalid_0's ndcg@4: 0.974545\tvalid_0's ndcg@5: 0.974594\n",
- "[24]\tvalid_0's ndcg@1: 0.933175\tvalid_0's ndcg@2: 0.972734\tvalid_0's ndcg@3: 0.974347\tvalid_0's ndcg@4: 0.974638\tvalid_0's ndcg@5: 0.974686\n",
- "[25]\tvalid_0's ndcg@1: 0.9331\tvalid_0's ndcg@2: 0.972754\tvalid_0's ndcg@3: 0.974366\tvalid_0's ndcg@4: 0.974636\tvalid_0's ndcg@5: 0.974674\n"
- ]
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:20:41.843180Z",
+ "start_time": "2020-11-18T04:20:41.837287Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data_path = './data_raw/'\n",
+ "save_path = './temp_results/'\n",
+ "offline = False"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[26]\tvalid_0's ndcg@1: 0.933275\tvalid_0's ndcg@2: 0.972787\tvalid_0's ndcg@3: 0.974424\tvalid_0's ndcg@4: 0.974694\tvalid_0's ndcg@5: 0.974732\n",
- "[27]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972809\tvalid_0's ndcg@3: 0.974434\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.974732\n",
- "[28]\tvalid_0's ndcg@1: 0.933625\tvalid_0's ndcg@2: 0.972932\tvalid_0's ndcg@3: 0.974557\tvalid_0's ndcg@4: 0.974826\tvalid_0's ndcg@5: 0.974855\n",
- "[29]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972937\tvalid_0's ndcg@3: 0.974587\tvalid_0's ndcg@4: 0.974856\tvalid_0's ndcg@5: 0.974885\n",
- "[30]\tvalid_0's ndcg@1: 0.93355\tvalid_0's ndcg@2: 0.972873\tvalid_0's ndcg@3: 0.974523\tvalid_0's ndcg@4: 0.974792\tvalid_0's ndcg@5: 0.974821\n",
- "[31]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973065\tvalid_0's ndcg@3: 0.974753\tvalid_0's ndcg@4: 0.975022\tvalid_0's ndcg@5: 0.975051\n",
- "[32]\tvalid_0's ndcg@1: 0.93435\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974815\tvalid_0's ndcg@4: 0.975084\tvalid_0's ndcg@5: 0.975113\n",
- "[33]\tvalid_0's ndcg@1: 0.934475\tvalid_0's ndcg@2: 0.97323\tvalid_0's ndcg@3: 0.974855\tvalid_0's ndcg@4: 0.975135\tvalid_0's ndcg@5: 0.975164\n",
- "[34]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973113\tvalid_0's ndcg@3: 0.974738\tvalid_0's ndcg@4: 0.975028\tvalid_0's ndcg@5: 0.975057\n",
- "[35]\tvalid_0's ndcg@1: 0.93455\tvalid_0's ndcg@2: 0.973258\tvalid_0's ndcg@3: 0.97487\tvalid_0's ndcg@4: 0.975172\tvalid_0's ndcg@5: 0.975201\n",
- "[36]\tvalid_0's ndcg@1: 0.9344\tvalid_0's ndcg@2: 0.973265\tvalid_0's ndcg@3: 0.974828\tvalid_0's ndcg@4: 0.975129\tvalid_0's ndcg@5: 0.975158\n",
- "[37]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973438\tvalid_0's ndcg@3: 0.975013\tvalid_0's ndcg@4: 0.975304\tvalid_0's ndcg@5: 0.975323\n",
- "[38]\tvalid_0's ndcg@1: 0.934975\tvalid_0's ndcg@2: 0.973541\tvalid_0's ndcg@3: 0.975066\tvalid_0's ndcg@4: 0.975367\tvalid_0's ndcg@5: 0.975386\n",
- "[39]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973667\tvalid_0's ndcg@3: 0.975192\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975502\n",
- "[40]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973624\tvalid_0's ndcg@3: 0.975174\tvalid_0's ndcg@4: 0.975454\tvalid_0's ndcg@5: 0.975473\n",
- "[41]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973686\tvalid_0's ndcg@3: 0.975223\tvalid_0's ndcg@4: 0.975503\tvalid_0's ndcg@5: 0.975522\n",
- "[42]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973716\tvalid_0's ndcg@3: 0.975266\tvalid_0's ndcg@4: 0.975546\tvalid_0's ndcg@5: 0.975565\n",
- "[43]\tvalid_0's ndcg@1: 0.93615\tvalid_0's ndcg@2: 0.974022\tvalid_0's ndcg@3: 0.975534\tvalid_0's ndcg@4: 0.975814\tvalid_0's ndcg@5: 0.975843\n",
- "[44]\tvalid_0's ndcg@1: 0.936225\tvalid_0's ndcg@2: 0.974112\tvalid_0's ndcg@3: 0.975562\tvalid_0's ndcg@4: 0.975853\tvalid_0's ndcg@5: 0.975882\n",
- "[45]\tvalid_0's ndcg@1: 0.9365\tvalid_0's ndcg@2: 0.974167\tvalid_0's ndcg@3: 0.975654\tvalid_0's ndcg@4: 0.975945\tvalid_0's ndcg@5: 0.975974\n",
- "[46]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974206\tvalid_0's ndcg@3: 0.975694\tvalid_0's ndcg@4: 0.975995\tvalid_0's ndcg@5: 0.976024\n",
- "[47]\tvalid_0's ndcg@1: 0.93685\tvalid_0's ndcg@2: 0.974311\tvalid_0's ndcg@3: 0.975786\tvalid_0's ndcg@4: 0.976077\tvalid_0's ndcg@5: 0.976106\n",
- "[48]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974408\tvalid_0's ndcg@3: 0.975845\tvalid_0's ndcg@4: 0.976147\tvalid_0's ndcg@5: 0.976185\n",
- "[49]\tvalid_0's ndcg@1: 0.936975\tvalid_0's ndcg@2: 0.974342\tvalid_0's ndcg@3: 0.975829\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.976159\n",
- "[50]\tvalid_0's ndcg@1: 0.9371\tvalid_0's ndcg@2: 0.974388\tvalid_0's ndcg@3: 0.97585\tvalid_0's ndcg@4: 0.976152\tvalid_0's ndcg@5: 0.976191\n",
- "[51]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974329\tvalid_0's ndcg@3: 0.975841\tvalid_0's ndcg@4: 0.976121\tvalid_0's ndcg@5: 0.97616\n",
- "[52]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974578\tvalid_0's ndcg@3: 0.976078\tvalid_0's ndcg@4: 0.976369\tvalid_0's ndcg@5: 0.976407\n",
- "[53]\tvalid_0's ndcg@1: 0.9378\tvalid_0's ndcg@2: 0.974615\tvalid_0's ndcg@3: 0.976115\tvalid_0's ndcg@4: 0.976405\tvalid_0's ndcg@5: 0.976444\n",
- "[54]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974689\tvalid_0's ndcg@3: 0.976214\tvalid_0's ndcg@4: 0.976483\tvalid_0's ndcg@5: 0.976521\n",
- "[55]\tvalid_0's ndcg@1: 0.938225\tvalid_0's ndcg@2: 0.974803\tvalid_0's ndcg@3: 0.976303\tvalid_0's ndcg@4: 0.976572\tvalid_0's ndcg@5: 0.976611\n",
- "[56]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.9748\tvalid_0's ndcg@3: 0.976275\tvalid_0's ndcg@4: 0.976555\tvalid_0's ndcg@5: 0.976594\n",
- "[57]\tvalid_0's ndcg@1: 0.938525\tvalid_0's ndcg@2: 0.974914\tvalid_0's ndcg@3: 0.976414\tvalid_0's ndcg@4: 0.976683\tvalid_0's ndcg@5: 0.976722\n",
- "[58]\tvalid_0's ndcg@1: 0.93875\tvalid_0's ndcg@2: 0.975028\tvalid_0's ndcg@3: 0.976503\tvalid_0's ndcg@4: 0.976773\tvalid_0's ndcg@5: 0.976811\n",
- "[59]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975198\tvalid_0's ndcg@3: 0.976648\tvalid_0's ndcg@4: 0.976918\tvalid_0's ndcg@5: 0.976956\n",
- "[60]\tvalid_0's ndcg@1: 0.939025\tvalid_0's ndcg@2: 0.975177\tvalid_0's ndcg@3: 0.976615\tvalid_0's ndcg@4: 0.976884\tvalid_0's ndcg@5: 0.976923\n",
- "[61]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975205\tvalid_0's ndcg@3: 0.976642\tvalid_0's ndcg@4: 0.976912\tvalid_0's ndcg@5: 0.97695\n",
- "[62]\tvalid_0's ndcg@1: 0.93965\tvalid_0's ndcg@2: 0.975424\tvalid_0's ndcg@3: 0.976836\tvalid_0's ndcg@4: 0.977116\tvalid_0's ndcg@5: 0.977155\n",
- "[63]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.976996\tvalid_0's ndcg@4: 0.977276\tvalid_0's ndcg@5: 0.977315\n",
- "[64]\tvalid_0's ndcg@1: 0.940375\tvalid_0's ndcg@2: 0.975723\tvalid_0's ndcg@3: 0.977123\tvalid_0's ndcg@4: 0.977392\tvalid_0's ndcg@5: 0.977431\n",
- "[65]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977154\tvalid_0's ndcg@4: 0.977423\tvalid_0's ndcg@5: 0.977462\n",
- "[66]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975744\tvalid_0's ndcg@3: 0.977156\tvalid_0's ndcg@4: 0.977426\tvalid_0's ndcg@5: 0.977464\n",
- "[67]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.97576\tvalid_0's ndcg@3: 0.977172\tvalid_0's ndcg@4: 0.977431\tvalid_0's ndcg@5: 0.977469\n",
- "[68]\tvalid_0's ndcg@1: 0.940675\tvalid_0's ndcg@2: 0.975849\tvalid_0's ndcg@3: 0.977249\tvalid_0's ndcg@4: 0.977508\tvalid_0's ndcg@5: 0.977546\n",
- "[69]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.976017\tvalid_0's ndcg@3: 0.977454\tvalid_0's ndcg@4: 0.977724\tvalid_0's ndcg@5: 0.977762\n",
- "[70]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.977362\tvalid_0's ndcg@4: 0.977631\tvalid_0's ndcg@5: 0.97767\n",
- "[71]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.97735\tvalid_0's ndcg@4: 0.97763\tvalid_0's ndcg@5: 0.977668\n",
- "[72]\tvalid_0's ndcg@1: 0.941325\tvalid_0's ndcg@2: 0.976058\tvalid_0's ndcg@3: 0.97747\tvalid_0's ndcg@4: 0.977739\tvalid_0's ndcg@5: 0.977778\n",
- "[73]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977476\tvalid_0's ndcg@4: 0.977756\tvalid_0's ndcg@5: 0.977795\n",
- "[74]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.97759\tvalid_0's ndcg@4: 0.97788\tvalid_0's ndcg@5: 0.977919\n",
- "[75]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.977602\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.977921\n",
- "[76]\tvalid_0's ndcg@1: 0.94195\tvalid_0's ndcg@2: 0.976273\tvalid_0's ndcg@3: 0.977685\tvalid_0's ndcg@4: 0.977965\tvalid_0's ndcg@5: 0.978004\n",
- "[77]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.97627\tvalid_0's ndcg@3: 0.97767\tvalid_0's ndcg@4: 0.97795\tvalid_0's ndcg@5: 0.977989\n",
- "[78]\tvalid_0's ndcg@1: 0.94235\tvalid_0's ndcg@2: 0.976452\tvalid_0's ndcg@3: 0.977839\tvalid_0's ndcg@4: 0.978119\tvalid_0's ndcg@5: 0.978158\n",
- "[79]\tvalid_0's ndcg@1: 0.94265\tvalid_0's ndcg@2: 0.976562\tvalid_0's ndcg@3: 0.977937\tvalid_0's ndcg@4: 0.978228\tvalid_0's ndcg@5: 0.978267\n",
- "[80]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976667\tvalid_0's ndcg@3: 0.978067\tvalid_0's ndcg@4: 0.978347\tvalid_0's ndcg@5: 0.978385\n",
- "[81]\tvalid_0's ndcg@1: 0.94305\tvalid_0's ndcg@2: 0.97671\tvalid_0's ndcg@3: 0.978098\tvalid_0's ndcg@4: 0.978378\tvalid_0's ndcg@5: 0.978416\n",
- "[82]\tvalid_0's ndcg@1: 0.943175\tvalid_0's ndcg@2: 0.97674\tvalid_0's ndcg@3: 0.978115\tvalid_0's ndcg@4: 0.978417\tvalid_0's ndcg@5: 0.978456\n",
- "[83]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976752\tvalid_0's ndcg@3: 0.97814\tvalid_0's ndcg@4: 0.978441\tvalid_0's ndcg@5: 0.97848\n",
- "[84]\tvalid_0's ndcg@1: 0.943375\tvalid_0's ndcg@2: 0.976767\tvalid_0's ndcg@3: 0.978179\tvalid_0's ndcg@4: 0.978481\tvalid_0's ndcg@5: 0.97852\n",
- "[85]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976721\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978437\tvalid_0's ndcg@5: 0.978475\n",
- "[86]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976792\tvalid_0's ndcg@3: 0.978204\tvalid_0's ndcg@4: 0.978506\tvalid_0's ndcg@5: 0.978535\n",
- "[87]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.976851\tvalid_0's ndcg@3: 0.978239\tvalid_0's ndcg@4: 0.97854\tvalid_0's ndcg@5: 0.978569\n",
- "[88]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976882\tvalid_0's ndcg@3: 0.978282\tvalid_0's ndcg@4: 0.978572\tvalid_0's ndcg@5: 0.978611\n",
- "[89]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.976915\tvalid_0's ndcg@3: 0.97834\tvalid_0's ndcg@4: 0.97863\tvalid_0's ndcg@5: 0.978669\n"
- ]
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:20:53.358138Z",
+ "start_time": "2020-11-18T04:20:44.232944Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 重新读取数据的时候,发现click_article_id是一个浮点数,所以将其转换成int类型\n",
+ "trn_user_item_feats_df = pd.read_csv(save_path + 'trn_user_item_feats_df.csv')\n",
+ "trn_user_item_feats_df['click_article_id'] = trn_user_item_feats_df['click_article_id'].astype(int)\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df = pd.read_csv(save_path + 'val_user_item_feats_df.csv')\n",
+ " val_user_item_feats_df['click_article_id'] = val_user_item_feats_df['click_article_id'].astype(int)\n",
+ "else:\n",
+ " val_user_item_feats_df = None\n",
+ " \n",
+ "tst_user_item_feats_df = pd.read_csv(save_path + 'tst_user_item_feats_df.csv')\n",
+ "tst_user_item_feats_df['click_article_id'] = tst_user_item_feats_df['click_article_id'].astype(int)\n",
+ "\n",
+ "# 做特征的时候为了方便,给测试集也打上了一个无效的标签,这里直接删掉就行\n",
+ "del tst_user_item_feats_df['label']"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[90]\tvalid_0's ndcg@1: 0.943925\tvalid_0's ndcg@2: 0.976986\tvalid_0's ndcg@3: 0.978398\tvalid_0's ndcg@4: 0.978689\tvalid_0's ndcg@5: 0.978728\n",
- "[91]\tvalid_0's ndcg@1: 0.943875\tvalid_0's ndcg@2: 0.976999\tvalid_0's ndcg@3: 0.978399\tvalid_0's ndcg@4: 0.978679\tvalid_0's ndcg@5: 0.978717\n",
- "[92]\tvalid_0's ndcg@1: 0.94395\tvalid_0's ndcg@2: 0.977058\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978711\tvalid_0's ndcg@5: 0.97876\n",
- "[93]\tvalid_0's ndcg@1: 0.944075\tvalid_0's ndcg@2: 0.977104\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978759\tvalid_0's ndcg@5: 0.978807\n",
- "[94]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977125\tvalid_0's ndcg@3: 0.978513\tvalid_0's ndcg@4: 0.978793\tvalid_0's ndcg@5: 0.978841\n",
- "[95]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977153\tvalid_0's ndcg@3: 0.97854\tvalid_0's ndcg@4: 0.97882\tvalid_0's ndcg@5: 0.978869\n",
- "[96]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977144\tvalid_0's ndcg@3: 0.978531\tvalid_0's ndcg@4: 0.978811\tvalid_0's ndcg@5: 0.97886\n",
- "[97]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977221\tvalid_0's ndcg@3: 0.978584\tvalid_0's ndcg@4: 0.978864\tvalid_0's ndcg@5: 0.978912\n",
- "[98]\tvalid_0's ndcg@1: 0.944575\tvalid_0's ndcg@2: 0.977289\tvalid_0's ndcg@3: 0.978651\tvalid_0's ndcg@4: 0.978942\tvalid_0's ndcg@5: 0.97899\n",
- "[99]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977341\tvalid_0's ndcg@3: 0.978691\tvalid_0's ndcg@4: 0.978993\tvalid_0's ndcg@5: 0.979032\n",
- "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
- "[1]\tvalid_0's ndcg@1: 0.911575\tvalid_0's ndcg@2: 0.964384\tvalid_0's ndcg@3: 0.966321\tvalid_0's ndcg@4: 0.966623\tvalid_0's ndcg@5: 0.966671\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9136\tvalid_0's ndcg@2: 0.965257\tvalid_0's ndcg@3: 0.967107\tvalid_0's ndcg@4: 0.967398\tvalid_0's ndcg@5: 0.967456\n",
- "[3]\tvalid_0's ndcg@1: 0.917425\tvalid_0's ndcg@2: 0.966732\tvalid_0's ndcg@3: 0.968545\tvalid_0's ndcg@4: 0.968814\tvalid_0's ndcg@5: 0.968882\n",
- "[4]\tvalid_0's ndcg@1: 0.9222\tvalid_0's ndcg@2: 0.968558\tvalid_0's ndcg@3: 0.970383\tvalid_0's ndcg@4: 0.970619\tvalid_0's ndcg@5: 0.970668\n",
- "[5]\tvalid_0's ndcg@1: 0.925875\tvalid_0's ndcg@2: 0.969914\tvalid_0's ndcg@3: 0.971714\tvalid_0's ndcg@4: 0.971972\tvalid_0's ndcg@5: 0.972021\n",
- "[6]\tvalid_0's ndcg@1: 0.926875\tvalid_0's ndcg@2: 0.970425\tvalid_0's ndcg@3: 0.972112\tvalid_0's ndcg@4: 0.972371\tvalid_0's ndcg@5: 0.972419\n",
- "[7]\tvalid_0's ndcg@1: 0.927475\tvalid_0's ndcg@2: 0.970631\tvalid_0's ndcg@3: 0.972306\tvalid_0's ndcg@4: 0.972586\tvalid_0's ndcg@5: 0.972634\n",
- "[8]\tvalid_0's ndcg@1: 0.93015\tvalid_0's ndcg@2: 0.971649\tvalid_0's ndcg@3: 0.973287\tvalid_0's ndcg@4: 0.973567\tvalid_0's ndcg@5: 0.973625\n",
- "[9]\tvalid_0's ndcg@1: 0.9312\tvalid_0's ndcg@2: 0.972084\tvalid_0's ndcg@3: 0.973684\tvalid_0's ndcg@4: 0.973964\tvalid_0's ndcg@5: 0.974022\n",
- "[10]\tvalid_0's ndcg@1: 0.93225\tvalid_0's ndcg@2: 0.972456\tvalid_0's ndcg@3: 0.974081\tvalid_0's ndcg@4: 0.974361\tvalid_0's ndcg@5: 0.974409\n",
- "[11]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972704\tvalid_0's ndcg@3: 0.974379\tvalid_0's ndcg@4: 0.974648\tvalid_0's ndcg@5: 0.974696\n",
- "[12]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974574\tvalid_0's ndcg@4: 0.974832\tvalid_0's ndcg@5: 0.974881\n",
- "[13]\tvalid_0's ndcg@1: 0.93415\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.97482\tvalid_0's ndcg@4: 0.975079\tvalid_0's ndcg@5: 0.975127\n",
- "[14]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973671\tvalid_0's ndcg@3: 0.975246\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975531\n",
- "[15]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.973877\tvalid_0's ndcg@3: 0.975452\tvalid_0's ndcg@4: 0.975699\tvalid_0's ndcg@5: 0.975748\n",
- "[16]\tvalid_0's ndcg@1: 0.935825\tvalid_0's ndcg@2: 0.973917\tvalid_0's ndcg@3: 0.975442\tvalid_0's ndcg@4: 0.975712\tvalid_0's ndcg@5: 0.97576\n",
- "[17]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.97411\tvalid_0's ndcg@3: 0.975697\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975995\n",
- "[18]\tvalid_0's ndcg@1: 0.936925\tvalid_0's ndcg@2: 0.974292\tvalid_0's ndcg@3: 0.975867\tvalid_0's ndcg@4: 0.976114\tvalid_0's ndcg@5: 0.976163\n",
- "[19]\tvalid_0's ndcg@1: 0.937525\tvalid_0's ndcg@2: 0.974545\tvalid_0's ndcg@3: 0.976095\tvalid_0's ndcg@4: 0.976342\tvalid_0's ndcg@5: 0.976391\n",
- "[20]\tvalid_0's ndcg@1: 0.937775\tvalid_0's ndcg@2: 0.974653\tvalid_0's ndcg@3: 0.976203\tvalid_0's ndcg@4: 0.976429\tvalid_0's ndcg@5: 0.976487\n",
- "[21]\tvalid_0's ndcg@1: 0.938825\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976597\tvalid_0's ndcg@4: 0.976823\tvalid_0's ndcg@5: 0.976881\n",
- "[22]\tvalid_0's ndcg@1: 0.93885\tvalid_0's ndcg@2: 0.975097\tvalid_0's ndcg@3: 0.976609\tvalid_0's ndcg@4: 0.976846\tvalid_0's ndcg@5: 0.976895\n",
- "[23]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976733\tvalid_0's ndcg@4: 0.976959\tvalid_0's ndcg@5: 0.977008\n",
- "[24]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976721\tvalid_0's ndcg@4: 0.976947\tvalid_0's ndcg@5: 0.977005\n",
- "[25]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975421\tvalid_0's ndcg@3: 0.976909\tvalid_0's ndcg@4: 0.977124\tvalid_0's ndcg@5: 0.977182\n",
- "[26]\tvalid_0's ndcg@1: 0.9393\tvalid_0's ndcg@2: 0.975342\tvalid_0's ndcg@3: 0.976804\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977078\n",
- "[27]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975323\tvalid_0's ndcg@3: 0.976798\tvalid_0's ndcg@4: 0.977014\tvalid_0's ndcg@5: 0.977062\n",
- "[28]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975308\tvalid_0's ndcg@3: 0.976783\tvalid_0's ndcg@4: 0.977009\tvalid_0's ndcg@5: 0.977057\n",
- "[29]\tvalid_0's ndcg@1: 0.94\tvalid_0's ndcg@2: 0.975569\tvalid_0's ndcg@3: 0.977056\tvalid_0's ndcg@4: 0.977282\tvalid_0's ndcg@5: 0.977331\n",
- "[30]\tvalid_0's ndcg@1: 0.940325\tvalid_0's ndcg@2: 0.975673\tvalid_0's ndcg@3: 0.977173\tvalid_0's ndcg@4: 0.977399\tvalid_0's ndcg@5: 0.977447\n",
- "[31]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975731\tvalid_0's ndcg@3: 0.977243\tvalid_0's ndcg@4: 0.977469\tvalid_0's ndcg@5: 0.977518\n",
- "[32]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
- "[33]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977241\tvalid_0's ndcg@4: 0.977457\tvalid_0's ndcg@5: 0.977505\n",
- "[34]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
- "[35]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975868\tvalid_0's ndcg@3: 0.977343\tvalid_0's ndcg@4: 0.977558\tvalid_0's ndcg@5: 0.977606\n",
- "[36]\tvalid_0's ndcg@1: 0.94115\tvalid_0's ndcg@2: 0.976056\tvalid_0's ndcg@3: 0.977506\tvalid_0's ndcg@4: 0.977722\tvalid_0's ndcg@5: 0.97777\n",
- "[37]\tvalid_0's ndcg@1: 0.9414\tvalid_0's ndcg@2: 0.976133\tvalid_0's ndcg@3: 0.977595\tvalid_0's ndcg@4: 0.977811\tvalid_0's ndcg@5: 0.977859\n",
- "[38]\tvalid_0's ndcg@1: 0.94175\tvalid_0's ndcg@2: 0.976278\tvalid_0's ndcg@3: 0.977715\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.97799\n",
- "[39]\tvalid_0's ndcg@1: 0.942075\tvalid_0's ndcg@2: 0.976366\tvalid_0's ndcg@3: 0.977841\tvalid_0's ndcg@4: 0.978056\tvalid_0's ndcg@5: 0.978105\n",
- "[40]\tvalid_0's ndcg@1: 0.94215\tvalid_0's ndcg@2: 0.976409\tvalid_0's ndcg@3: 0.977872\tvalid_0's ndcg@4: 0.978087\tvalid_0's ndcg@5: 0.978136\n",
- "[41]\tvalid_0's ndcg@1: 0.94245\tvalid_0's ndcg@2: 0.97652\tvalid_0's ndcg@3: 0.977983\tvalid_0's ndcg@4: 0.978198\tvalid_0's ndcg@5: 0.978246\n",
- "[42]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
- "[43]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
- "[44]\tvalid_0's ndcg@1: 0.94285\tvalid_0's ndcg@2: 0.976636\tvalid_0's ndcg@3: 0.978111\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978386\n",
- "[45]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.9768\tvalid_0's ndcg@3: 0.978262\tvalid_0's ndcg@4: 0.978488\tvalid_0's ndcg@5: 0.978537\n",
- "[46]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
- "[47]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97836\tvalid_0's ndcg@4: 0.978576\tvalid_0's ndcg@5: 0.978634\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 返回排序后的结果"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[48]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.978373\tvalid_0's ndcg@4: 0.978577\tvalid_0's ndcg@5: 0.978636\n",
- "[49]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
- "[50]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97702\tvalid_0's ndcg@3: 0.97852\tvalid_0's ndcg@4: 0.978746\tvalid_0's ndcg@5: 0.978794\n",
- "[51]\tvalid_0's ndcg@1: 0.9441\tvalid_0's ndcg@2: 0.97705\tvalid_0's ndcg@3: 0.97855\tvalid_0's ndcg@4: 0.978787\tvalid_0's ndcg@5: 0.978836\n",
- "[52]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977121\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978846\tvalid_0's ndcg@5: 0.978894\n",
- "[53]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977081\tvalid_0's ndcg@3: 0.978618\tvalid_0's ndcg@4: 0.978834\tvalid_0's ndcg@5: 0.978882\n",
- "[54]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977071\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978824\tvalid_0's ndcg@5: 0.978873\n",
- "[55]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977143\tvalid_0's ndcg@3: 0.978668\tvalid_0's ndcg@4: 0.978883\tvalid_0's ndcg@5: 0.978931\n",
- "[56]\tvalid_0's ndcg@1: 0.9444\tvalid_0's ndcg@2: 0.977177\tvalid_0's ndcg@3: 0.978702\tvalid_0's ndcg@4: 0.978906\tvalid_0's ndcg@5: 0.978955\n",
- "[57]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977263\tvalid_0's ndcg@3: 0.978788\tvalid_0's ndcg@4: 0.979003\tvalid_0's ndcg@5: 0.979051\n",
- "[58]\tvalid_0's ndcg@1: 0.9448\tvalid_0's ndcg@2: 0.977293\tvalid_0's ndcg@3: 0.978843\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979096\n",
- "[59]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977472\tvalid_0's ndcg@3: 0.978997\tvalid_0's ndcg@4: 0.979202\tvalid_0's ndcg@5: 0.97925\n",
- "[60]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.97763\tvalid_0's ndcg@3: 0.979118\tvalid_0's ndcg@4: 0.979322\tvalid_0's ndcg@5: 0.979371\n",
- "[61]\tvalid_0's ndcg@1: 0.945725\tvalid_0's ndcg@2: 0.977682\tvalid_0's ndcg@3: 0.979194\tvalid_0's ndcg@4: 0.979399\tvalid_0's ndcg@5: 0.979447\n",
- "[62]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977812\tvalid_0's ndcg@3: 0.979312\tvalid_0's ndcg@4: 0.979495\tvalid_0's ndcg@5: 0.979543\n",
- "[63]\tvalid_0's ndcg@1: 0.946\tvalid_0's ndcg@2: 0.977878\tvalid_0's ndcg@3: 0.97934\tvalid_0's ndcg@4: 0.979523\tvalid_0's ndcg@5: 0.979572\n",
- "[64]\tvalid_0's ndcg@1: 0.946525\tvalid_0's ndcg@2: 0.978056\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979714\tvalid_0's ndcg@5: 0.979762\n",
- "[65]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.978105\tvalid_0's ndcg@3: 0.979592\tvalid_0's ndcg@4: 0.979775\tvalid_0's ndcg@5: 0.979823\n",
- "[66]\tvalid_0's ndcg@1: 0.9465\tvalid_0's ndcg@2: 0.978046\tvalid_0's ndcg@3: 0.979534\tvalid_0's ndcg@4: 0.979706\tvalid_0's ndcg@5: 0.979755\n",
- "[67]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.978127\tvalid_0's ndcg@3: 0.979614\tvalid_0's ndcg@4: 0.979776\tvalid_0's ndcg@5: 0.979824\n",
- "[68]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.979608\tvalid_0's ndcg@4: 0.97978\tvalid_0's ndcg@5: 0.979828\n",
- "[69]\tvalid_0's ndcg@1: 0.946875\tvalid_0's ndcg@2: 0.978216\tvalid_0's ndcg@3: 0.979679\tvalid_0's ndcg@4: 0.979851\tvalid_0's ndcg@5: 0.9799\n",
- "[70]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.978194\tvalid_0's ndcg@3: 0.979682\tvalid_0's ndcg@4: 0.979854\tvalid_0's ndcg@5: 0.979902\n",
- "[71]\tvalid_0's ndcg@1: 0.947025\tvalid_0's ndcg@2: 0.978209\tvalid_0's ndcg@3: 0.979721\tvalid_0's ndcg@4: 0.979893\tvalid_0's ndcg@5: 0.979942\n",
- "[72]\tvalid_0's ndcg@1: 0.9472\tvalid_0's ndcg@2: 0.978273\tvalid_0's ndcg@3: 0.979773\tvalid_0's ndcg@4: 0.979956\tvalid_0's ndcg@5: 0.980005\n",
- "[73]\tvalid_0's ndcg@1: 0.947475\tvalid_0's ndcg@2: 0.978391\tvalid_0's ndcg@3: 0.979878\tvalid_0's ndcg@4: 0.980061\tvalid_0's ndcg@5: 0.980109\n",
- "[74]\tvalid_0's ndcg@1: 0.94715\tvalid_0's ndcg@2: 0.978271\tvalid_0's ndcg@3: 0.979758\tvalid_0's ndcg@4: 0.979941\tvalid_0's ndcg@5: 0.97999\n",
- "[75]\tvalid_0's ndcg@1: 0.947275\tvalid_0's ndcg@2: 0.978333\tvalid_0's ndcg@3: 0.979808\tvalid_0's ndcg@4: 0.979991\tvalid_0's ndcg@5: 0.980039\n",
- "[76]\tvalid_0's ndcg@1: 0.9474\tvalid_0's ndcg@2: 0.97841\tvalid_0's ndcg@3: 0.979873\tvalid_0's ndcg@4: 0.980045\tvalid_0's ndcg@5: 0.980093\n",
- "[77]\tvalid_0's ndcg@1: 0.94745\tvalid_0's ndcg@2: 0.97846\tvalid_0's ndcg@3: 0.979898\tvalid_0's ndcg@4: 0.98007\tvalid_0's ndcg@5: 0.980118\n",
- "[78]\tvalid_0's ndcg@1: 0.94775\tvalid_0's ndcg@2: 0.978555\tvalid_0's ndcg@3: 0.980005\tvalid_0's ndcg@4: 0.980177\tvalid_0's ndcg@5: 0.980226\n",
- "[79]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
- "[80]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
- "[81]\tvalid_0's ndcg@1: 0.948175\tvalid_0's ndcg@2: 0.978744\tvalid_0's ndcg@3: 0.980169\tvalid_0's ndcg@4: 0.980352\tvalid_0's ndcg@5: 0.98039\n",
- "[82]\tvalid_0's ndcg@1: 0.948375\tvalid_0's ndcg@2: 0.97888\tvalid_0's ndcg@3: 0.980255\tvalid_0's ndcg@4: 0.980438\tvalid_0's ndcg@5: 0.980477\n",
- "[83]\tvalid_0's ndcg@1: 0.94825\tvalid_0's ndcg@2: 0.978834\tvalid_0's ndcg@3: 0.980209\tvalid_0's ndcg@4: 0.980392\tvalid_0's ndcg@5: 0.980431\n",
- "[84]\tvalid_0's ndcg@1: 0.948275\tvalid_0's ndcg@2: 0.978844\tvalid_0's ndcg@3: 0.980219\tvalid_0's ndcg@4: 0.980402\tvalid_0's ndcg@5: 0.98044\n",
- "[85]\tvalid_0's ndcg@1: 0.948475\tvalid_0's ndcg@2: 0.978917\tvalid_0's ndcg@3: 0.980292\tvalid_0's ndcg@4: 0.980475\tvalid_0's ndcg@5: 0.980514\n",
- "[86]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979102\tvalid_0's ndcg@3: 0.980477\tvalid_0's ndcg@4: 0.98066\tvalid_0's ndcg@5: 0.980699\n",
- "[87]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979086\tvalid_0's ndcg@3: 0.980474\tvalid_0's ndcg@4: 0.980657\tvalid_0's ndcg@5: 0.980695\n",
- "[88]\tvalid_0's ndcg@1: 0.949025\tvalid_0's ndcg@2: 0.979136\tvalid_0's ndcg@3: 0.980499\tvalid_0's ndcg@4: 0.980682\tvalid_0's ndcg@5: 0.98072\n",
- "[89]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979285\tvalid_0's ndcg@3: 0.98061\tvalid_0's ndcg@4: 0.980793\tvalid_0's ndcg@5: 0.980832\n",
- "[90]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
- "[91]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
- "[92]\tvalid_0's ndcg@1: 0.9494\tvalid_0's ndcg@2: 0.97929\tvalid_0's ndcg@3: 0.98064\tvalid_0's ndcg@4: 0.980823\tvalid_0's ndcg@5: 0.980862\n",
- "[93]\tvalid_0's ndcg@1: 0.949375\tvalid_0's ndcg@2: 0.979297\tvalid_0's ndcg@3: 0.980634\tvalid_0's ndcg@4: 0.980817\tvalid_0's ndcg@5: 0.980856\n",
- "[94]\tvalid_0's ndcg@1: 0.949525\tvalid_0's ndcg@2: 0.979336\tvalid_0's ndcg@3: 0.980686\tvalid_0's ndcg@4: 0.980869\tvalid_0's ndcg@5: 0.980908\n",
- "[95]\tvalid_0's ndcg@1: 0.949825\tvalid_0's ndcg@2: 0.979416\tvalid_0's ndcg@3: 0.980791\tvalid_0's ndcg@4: 0.980974\tvalid_0's ndcg@5: 0.981012\n",
- "[96]\tvalid_0's ndcg@1: 0.94975\tvalid_0's ndcg@2: 0.979404\tvalid_0's ndcg@3: 0.980779\tvalid_0's ndcg@4: 0.980951\tvalid_0's ndcg@5: 0.98099\n",
- "[97]\tvalid_0's ndcg@1: 0.950025\tvalid_0's ndcg@2: 0.979537\tvalid_0's ndcg@3: 0.980874\tvalid_0's ndcg@4: 0.981057\tvalid_0's ndcg@5: 0.981096\n",
- "[98]\tvalid_0's ndcg@1: 0.9501\tvalid_0's ndcg@2: 0.979564\tvalid_0's ndcg@3: 0.980889\tvalid_0's ndcg@4: 0.981083\tvalid_0's ndcg@5: 0.981122\n",
- "[99]\tvalid_0's ndcg@1: 0.950275\tvalid_0's ndcg@2: 0.979629\tvalid_0's ndcg@3: 0.980967\tvalid_0's ndcg@4: 0.98115\tvalid_0's ndcg@5: 0.981188\n",
- "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
- "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.96382\tvalid_0's ndcg@3: 0.965707\tvalid_0's ndcg@4: 0.966009\tvalid_0's ndcg@5: 0.966086\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.91415\tvalid_0's ndcg@2: 0.965492\tvalid_0's ndcg@3: 0.967254\tvalid_0's ndcg@4: 0.967556\tvalid_0's ndcg@5: 0.967604\n",
- "[3]\tvalid_0's ndcg@1: 0.916025\tvalid_0's ndcg@2: 0.966389\tvalid_0's ndcg@3: 0.967976\tvalid_0's ndcg@4: 0.968278\tvalid_0's ndcg@5: 0.968355\n",
- "[4]\tvalid_0's ndcg@1: 0.919\tvalid_0's ndcg@2: 0.967392\tvalid_0's ndcg@3: 0.96903\tvalid_0's ndcg@4: 0.969364\tvalid_0's ndcg@5: 0.969431\n",
- "[5]\tvalid_0's ndcg@1: 0.921125\tvalid_0's ndcg@2: 0.968192\tvalid_0's ndcg@3: 0.969855\tvalid_0's ndcg@4: 0.970156\tvalid_0's ndcg@5: 0.970224\n",
- "[6]\tvalid_0's ndcg@1: 0.921675\tvalid_0's ndcg@2: 0.968411\tvalid_0's ndcg@3: 0.970111\tvalid_0's ndcg@4: 0.97037\tvalid_0's ndcg@5: 0.970437\n",
- "[7]\tvalid_0's ndcg@1: 0.9237\tvalid_0's ndcg@2: 0.969332\tvalid_0's ndcg@3: 0.970882\tvalid_0's ndcg@4: 0.97113\tvalid_0's ndcg@5: 0.971217\n",
- "[8]\tvalid_0's ndcg@1: 0.925775\tvalid_0's ndcg@2: 0.970129\tvalid_0's ndcg@3: 0.971642\tvalid_0's ndcg@4: 0.971922\tvalid_0's ndcg@5: 0.97199\n",
- "[9]\tvalid_0's ndcg@1: 0.926775\tvalid_0's ndcg@2: 0.970435\tvalid_0's ndcg@3: 0.971985\tvalid_0's ndcg@4: 0.972276\tvalid_0's ndcg@5: 0.972334\n"
- ]
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:01.809368Z",
+ "start_time": "2020-11-18T04:21:01.799641Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def submit(recall_df, topk=5, model_name=None):\n",
+ " recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])\n",
+ " recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 判断是不是每个用户都有5篇文章及以上\n",
+ " tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())\n",
+ " assert tmp.min() >= topk\n",
+ " \n",
+ " del recall_df['pred_score']\n",
+ " submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()\n",
+ " \n",
+ " submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]\n",
+ " # 按照提交格式定义列名\n",
+ " submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', \n",
+ " 3: 'article_3', 4: 'article_4', 5: 'article_5'})\n",
+ " \n",
+ " save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'\n",
+ " submit.to_csv(save_name, index=False, header=True)"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[10]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970761\tvalid_0's ndcg@3: 0.972311\tvalid_0's ndcg@4: 0.972612\tvalid_0's ndcg@5: 0.97267\n",
- "[11]\tvalid_0's ndcg@1: 0.928975\tvalid_0's ndcg@2: 0.97131\tvalid_0's ndcg@3: 0.972798\tvalid_0's ndcg@4: 0.973089\tvalid_0's ndcg@5: 0.973166\n",
- "[12]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971505\tvalid_0's ndcg@3: 0.972968\tvalid_0's ndcg@4: 0.973259\tvalid_0's ndcg@5: 0.973326\n",
- "[13]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971426\tvalid_0's ndcg@3: 0.972939\tvalid_0's ndcg@4: 0.97324\tvalid_0's ndcg@5: 0.973318\n",
- "[14]\tvalid_0's ndcg@1: 0.929775\tvalid_0's ndcg@2: 0.971621\tvalid_0's ndcg@3: 0.973121\tvalid_0's ndcg@4: 0.973412\tvalid_0's ndcg@5: 0.97348\n",
- "[15]\tvalid_0's ndcg@1: 0.9304\tvalid_0's ndcg@2: 0.971868\tvalid_0's ndcg@3: 0.97338\tvalid_0's ndcg@4: 0.97365\tvalid_0's ndcg@5: 0.973717\n",
- "[16]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.972096\tvalid_0's ndcg@3: 0.973558\tvalid_0's ndcg@4: 0.973849\tvalid_0's ndcg@5: 0.973926\n",
- "[17]\tvalid_0's ndcg@1: 0.93105\tvalid_0's ndcg@2: 0.972108\tvalid_0's ndcg@3: 0.973583\tvalid_0's ndcg@4: 0.973884\tvalid_0's ndcg@5: 0.973952\n",
- "[18]\tvalid_0's ndcg@1: 0.931725\tvalid_0's ndcg@2: 0.972373\tvalid_0's ndcg@3: 0.97386\tvalid_0's ndcg@4: 0.974129\tvalid_0's ndcg@5: 0.974207\n",
- "[19]\tvalid_0's ndcg@1: 0.932175\tvalid_0's ndcg@2: 0.972681\tvalid_0's ndcg@3: 0.974068\tvalid_0's ndcg@4: 0.974348\tvalid_0's ndcg@5: 0.974406\n",
- "[20]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.973019\tvalid_0's ndcg@3: 0.974382\tvalid_0's ndcg@4: 0.974673\tvalid_0's ndcg@5: 0.974731\n",
- "[21]\tvalid_0's ndcg@1: 0.933075\tvalid_0's ndcg@2: 0.97306\tvalid_0's ndcg@3: 0.974423\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.97477\n",
- "[22]\tvalid_0's ndcg@1: 0.93375\tvalid_0's ndcg@2: 0.973262\tvalid_0's ndcg@3: 0.974649\tvalid_0's ndcg@4: 0.974929\tvalid_0's ndcg@5: 0.975007\n",
- "[23]\tvalid_0's ndcg@1: 0.933675\tvalid_0's ndcg@2: 0.973219\tvalid_0's ndcg@3: 0.974606\tvalid_0's ndcg@4: 0.974886\tvalid_0's ndcg@5: 0.974973\n",
- "[24]\tvalid_0's ndcg@1: 0.934\tvalid_0's ndcg@2: 0.97337\tvalid_0's ndcg@3: 0.974745\tvalid_0's ndcg@4: 0.975014\tvalid_0's ndcg@5: 0.975101\n",
- "[25]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973674\tvalid_0's ndcg@3: 0.975062\tvalid_0's ndcg@4: 0.975342\tvalid_0's ndcg@5: 0.97541\n",
- "[26]\tvalid_0's ndcg@1: 0.93495\tvalid_0's ndcg@2: 0.973721\tvalid_0's ndcg@3: 0.975096\tvalid_0's ndcg@4: 0.975365\tvalid_0's ndcg@5: 0.975452\n",
- "[27]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.974082\tvalid_0's ndcg@3: 0.975444\tvalid_0's ndcg@4: 0.975713\tvalid_0's ndcg@5: 0.975781\n",
- "[28]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973875\tvalid_0's ndcg@3: 0.975275\tvalid_0's ndcg@4: 0.975512\tvalid_0's ndcg@5: 0.975599\n",
- "[29]\tvalid_0's ndcg@1: 0.935925\tvalid_0's ndcg@2: 0.974159\tvalid_0's ndcg@3: 0.975522\tvalid_0's ndcg@4: 0.975759\tvalid_0's ndcg@5: 0.975836\n",
- "[30]\tvalid_0's ndcg@1: 0.9362\tvalid_0's ndcg@2: 0.974214\tvalid_0's ndcg@3: 0.975589\tvalid_0's ndcg@4: 0.975847\tvalid_0's ndcg@5: 0.975924\n",
- "[31]\tvalid_0's ndcg@1: 0.93625\tvalid_0's ndcg@2: 0.974216\tvalid_0's ndcg@3: 0.975629\tvalid_0's ndcg@4: 0.975876\tvalid_0's ndcg@5: 0.975944\n",
- "[32]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974427\tvalid_0's ndcg@3: 0.975814\tvalid_0's ndcg@4: 0.97603\tvalid_0's ndcg@5: 0.976107\n",
- "[33]\tvalid_0's ndcg@1: 0.936775\tvalid_0's ndcg@2: 0.974505\tvalid_0's ndcg@3: 0.975855\tvalid_0's ndcg@4: 0.976081\tvalid_0's ndcg@5: 0.976158\n",
- "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974643\tvalid_0's ndcg@3: 0.975993\tvalid_0's ndcg@4: 0.976219\tvalid_0's ndcg@5: 0.976296\n",
- "[35]\tvalid_0's ndcg@1: 0.937675\tvalid_0's ndcg@2: 0.974805\tvalid_0's ndcg@3: 0.97618\tvalid_0's ndcg@4: 0.976406\tvalid_0's ndcg@5: 0.976484\n",
- "[36]\tvalid_0's ndcg@1: 0.9382\tvalid_0's ndcg@2: 0.974983\tvalid_0's ndcg@3: 0.976371\tvalid_0's ndcg@4: 0.976597\tvalid_0's ndcg@5: 0.976674\n",
- "[37]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.974974\tvalid_0's ndcg@3: 0.976349\tvalid_0's ndcg@4: 0.976586\tvalid_0's ndcg@5: 0.976663\n",
- "[38]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.975143\tvalid_0's ndcg@3: 0.976518\tvalid_0's ndcg@4: 0.976776\tvalid_0's ndcg@5: 0.976844\n",
- "[39]\tvalid_0's ndcg@1: 0.938575\tvalid_0's ndcg@2: 0.975106\tvalid_0's ndcg@3: 0.976481\tvalid_0's ndcg@4: 0.976739\tvalid_0's ndcg@5: 0.976807\n",
- "[40]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.97519\tvalid_0's ndcg@3: 0.976528\tvalid_0's ndcg@4: 0.976775\tvalid_0's ndcg@5: 0.976853\n",
- "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975347\tvalid_0's ndcg@3: 0.976697\tvalid_0's ndcg@4: 0.976934\tvalid_0's ndcg@5: 0.977001\n",
- "[42]\tvalid_0's ndcg@1: 0.939825\tvalid_0's ndcg@2: 0.975599\tvalid_0's ndcg@3: 0.976961\tvalid_0's ndcg@4: 0.977198\tvalid_0's ndcg@5: 0.977266\n",
- "[43]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975639\tvalid_0's ndcg@3: 0.976977\tvalid_0's ndcg@4: 0.977214\tvalid_0's ndcg@5: 0.977282\n",
- "[44]\tvalid_0's ndcg@1: 0.9398\tvalid_0's ndcg@2: 0.975605\tvalid_0's ndcg@3: 0.976955\tvalid_0's ndcg@4: 0.977192\tvalid_0's ndcg@5: 0.97726\n",
- "[45]\tvalid_0's ndcg@1: 0.9401\tvalid_0's ndcg@2: 0.9757\tvalid_0's ndcg@3: 0.977075\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
- "[46]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975845\tvalid_0's ndcg@3: 0.977183\tvalid_0's ndcg@4: 0.97742\tvalid_0's ndcg@5: 0.977497\n",
- "[47]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975854\tvalid_0's ndcg@3: 0.977204\tvalid_0's ndcg@4: 0.97743\tvalid_0's ndcg@5: 0.977508\n",
- "[48]\tvalid_0's ndcg@1: 0.940575\tvalid_0's ndcg@2: 0.975923\tvalid_0's ndcg@3: 0.977273\tvalid_0's ndcg@4: 0.977488\tvalid_0's ndcg@5: 0.977556\n",
- "[49]\tvalid_0's ndcg@1: 0.9407\tvalid_0's ndcg@2: 0.975922\tvalid_0's ndcg@3: 0.977297\tvalid_0's ndcg@4: 0.977501\tvalid_0's ndcg@5: 0.977588\n",
- "[50]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977322\tvalid_0's ndcg@4: 0.977505\tvalid_0's ndcg@5: 0.977592\n",
- "[51]\tvalid_0's ndcg@1: 0.9406\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.97725\tvalid_0's ndcg@4: 0.977422\tvalid_0's ndcg@5: 0.977509\n",
- "[52]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975997\tvalid_0's ndcg@3: 0.977422\tvalid_0's ndcg@4: 0.977594\tvalid_0's ndcg@5: 0.977691\n",
- "[53]\tvalid_0's ndcg@1: 0.940925\tvalid_0's ndcg@2: 0.975989\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977538\tvalid_0's ndcg@5: 0.977644\n",
- "[54]\tvalid_0's ndcg@1: 0.94125\tvalid_0's ndcg@2: 0.976062\tvalid_0's ndcg@3: 0.977487\tvalid_0's ndcg@4: 0.977659\tvalid_0's ndcg@5: 0.977756\n",
- "[55]\tvalid_0's ndcg@1: 0.94145\tvalid_0's ndcg@2: 0.976183\tvalid_0's ndcg@3: 0.97757\tvalid_0's ndcg@4: 0.977742\tvalid_0's ndcg@5: 0.977839\n",
- "[56]\tvalid_0's ndcg@1: 0.941475\tvalid_0's ndcg@2: 0.976176\tvalid_0's ndcg@3: 0.977576\tvalid_0's ndcg@4: 0.977748\tvalid_0's ndcg@5: 0.977845\n",
- "[57]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976139\tvalid_0's ndcg@3: 0.977539\tvalid_0's ndcg@4: 0.977712\tvalid_0's ndcg@5: 0.977808\n",
- "[58]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.97625\tvalid_0's ndcg@3: 0.97765\tvalid_0's ndcg@4: 0.977822\tvalid_0's ndcg@5: 0.977919\n",
- "[59]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.976253\tvalid_0's ndcg@3: 0.977653\tvalid_0's ndcg@4: 0.977836\tvalid_0's ndcg@5: 0.977932\n",
- "[60]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977634\tvalid_0's ndcg@4: 0.977817\tvalid_0's ndcg@5: 0.977914\n",
- "[61]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.976333\tvalid_0's ndcg@3: 0.977745\tvalid_0's ndcg@4: 0.977918\tvalid_0's ndcg@5: 0.978005\n",
- "[62]\tvalid_0's ndcg@1: 0.941975\tvalid_0's ndcg@2: 0.976345\tvalid_0's ndcg@3: 0.977757\tvalid_0's ndcg@4: 0.97794\tvalid_0's ndcg@5: 0.978027\n",
- "[63]\tvalid_0's ndcg@1: 0.9423\tvalid_0's ndcg@2: 0.976496\tvalid_0's ndcg@3: 0.977871\tvalid_0's ndcg@4: 0.978065\tvalid_0's ndcg@5: 0.978152\n",
- "[64]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976632\tvalid_0's ndcg@3: 0.977995\tvalid_0's ndcg@4: 0.978188\tvalid_0's ndcg@5: 0.978275\n",
- "[65]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976629\tvalid_0's ndcg@3: 0.977979\tvalid_0's ndcg@4: 0.978173\tvalid_0's ndcg@5: 0.97826\n",
- "[66]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976685\tvalid_0's ndcg@3: 0.978035\tvalid_0's ndcg@4: 0.978229\tvalid_0's ndcg@5: 0.978316\n",
- "[67]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976678\tvalid_0's ndcg@3: 0.978041\tvalid_0's ndcg@4: 0.978224\tvalid_0's ndcg@5: 0.97832\n",
- "[68]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976694\tvalid_0's ndcg@3: 0.978044\tvalid_0's ndcg@4: 0.978227\tvalid_0's ndcg@5: 0.978324\n",
- "[69]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976834\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978329\tvalid_0's ndcg@5: 0.978426\n",
- "[70]\tvalid_0's ndcg@1: 0.943025\tvalid_0's ndcg@2: 0.976827\tvalid_0's ndcg@3: 0.978152\tvalid_0's ndcg@4: 0.978324\tvalid_0's ndcg@5: 0.978431\n",
- "[71]\tvalid_0's ndcg@1: 0.9432\tvalid_0's ndcg@2: 0.976923\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978504\n",
- "[72]\tvalid_0's ndcg@1: 0.943225\tvalid_0's ndcg@2: 0.976917\tvalid_0's ndcg@3: 0.978254\tvalid_0's ndcg@4: 0.978405\tvalid_0's ndcg@5: 0.978511\n",
- "[73]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976936\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978409\tvalid_0's ndcg@5: 0.978496\n"
- ]
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:04.332198Z",
+ "start_time": "2020-11-18T04:21:04.325020Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序结果归一化\n",
+ "def norm_sim(sim_df, weight=0.0):\n",
+ " # print(sim_df.head())\n",
+ " min_sim = sim_df.min()\n",
+ " max_sim = sim_df.max()\n",
+ " if max_sim == min_sim:\n",
+ " sim_df = sim_df.apply(lambda sim: 1.0)\n",
+ " else:\n",
+ " sim_df = sim_df.apply(lambda sim: 1.0 * (sim - min_sim) / (max_sim - min_sim))\n",
+ "\n",
+ " sim_df = sim_df.apply(lambda sim: sim + weight) # plus one\n",
+ " return sim_df"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[74]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976957\tvalid_0's ndcg@3: 0.97827\tvalid_0's ndcg@4: 0.978431\tvalid_0's ndcg@5: 0.978528\n",
- "[75]\tvalid_0's ndcg@1: 0.943075\tvalid_0's ndcg@2: 0.976861\tvalid_0's ndcg@3: 0.978199\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978457\n",
- "[76]\tvalid_0's ndcg@1: 0.94335\tvalid_0's ndcg@2: 0.976963\tvalid_0's ndcg@3: 0.978288\tvalid_0's ndcg@4: 0.978471\tvalid_0's ndcg@5: 0.978568\n",
- "[77]\tvalid_0's ndcg@1: 0.94345\tvalid_0's ndcg@2: 0.977031\tvalid_0's ndcg@3: 0.978331\tvalid_0's ndcg@4: 0.978514\tvalid_0's ndcg@5: 0.978611\n",
- "[78]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.977088\tvalid_0's ndcg@3: 0.97835\tvalid_0's ndcg@4: 0.978533\tvalid_0's ndcg@5: 0.97863\n",
- "[79]\tvalid_0's ndcg@1: 0.943625\tvalid_0's ndcg@2: 0.977096\tvalid_0's ndcg@3: 0.978396\tvalid_0's ndcg@4: 0.978579\tvalid_0's ndcg@5: 0.978676\n",
- "[80]\tvalid_0's ndcg@1: 0.943825\tvalid_0's ndcg@2: 0.977154\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978651\tvalid_0's ndcg@5: 0.978748\n",
- "[81]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.97846\tvalid_0's ndcg@4: 0.978633\tvalid_0's ndcg@5: 0.978729\n",
- "[82]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.977361\tvalid_0's ndcg@3: 0.978673\tvalid_0's ndcg@4: 0.978845\tvalid_0's ndcg@5: 0.978933\n",
- "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977324\tvalid_0's ndcg@3: 0.978624\tvalid_0's ndcg@4: 0.978796\tvalid_0's ndcg@5: 0.978893\n",
- "[84]\tvalid_0's ndcg@1: 0.94405\tvalid_0's ndcg@2: 0.977253\tvalid_0's ndcg@3: 0.978565\tvalid_0's ndcg@4: 0.978737\tvalid_0's ndcg@5: 0.978834\n",
- "[85]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977283\tvalid_0's ndcg@3: 0.978633\tvalid_0's ndcg@4: 0.978795\tvalid_0's ndcg@5: 0.978882\n",
- "[86]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.97745\tvalid_0's ndcg@3: 0.978763\tvalid_0's ndcg@4: 0.978924\tvalid_0's ndcg@5: 0.979011\n",
- "[87]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.977419\tvalid_0's ndcg@3: 0.978756\tvalid_0's ndcg@4: 0.978918\tvalid_0's ndcg@5: 0.979005\n",
- "[88]\tvalid_0's ndcg@1: 0.944825\tvalid_0's ndcg@2: 0.977554\tvalid_0's ndcg@3: 0.978867\tvalid_0's ndcg@4: 0.979039\tvalid_0's ndcg@5: 0.979126\n",
- "[89]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977767\tvalid_0's ndcg@3: 0.979079\tvalid_0's ndcg@4: 0.979262\tvalid_0's ndcg@5: 0.97934\n",
- "[90]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977773\tvalid_0's ndcg@3: 0.979073\tvalid_0's ndcg@4: 0.979256\tvalid_0's ndcg@5: 0.979334\n",
- "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977792\tvalid_0's ndcg@3: 0.979092\tvalid_0's ndcg@4: 0.979275\tvalid_0's ndcg@5: 0.979352\n",
- "[92]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977776\tvalid_0's ndcg@3: 0.979088\tvalid_0's ndcg@4: 0.979261\tvalid_0's ndcg@5: 0.979348\n",
- "[93]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977757\tvalid_0's ndcg@3: 0.979082\tvalid_0's ndcg@4: 0.979244\tvalid_0's ndcg@5: 0.979331\n",
- "[94]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977761\tvalid_0's ndcg@3: 0.979061\tvalid_0's ndcg@4: 0.979223\tvalid_0's ndcg@5: 0.97931\n",
- "[95]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977798\tvalid_0's ndcg@3: 0.979086\tvalid_0's ndcg@4: 0.979258\tvalid_0's ndcg@5: 0.979345\n",
- "[96]\tvalid_0's ndcg@1: 0.945825\tvalid_0's ndcg@2: 0.977955\tvalid_0's ndcg@3: 0.97923\tvalid_0's ndcg@4: 0.979413\tvalid_0's ndcg@5: 0.9795\n",
- "[97]\tvalid_0's ndcg@1: 0.945925\tvalid_0's ndcg@2: 0.97796\tvalid_0's ndcg@3: 0.97926\tvalid_0's ndcg@4: 0.979443\tvalid_0's ndcg@5: 0.979531\n",
- "[98]\tvalid_0's ndcg@1: 0.9464\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.97942\tvalid_0's ndcg@4: 0.979625\tvalid_0's ndcg@5: 0.979702\n",
- "[99]\tvalid_0's ndcg@1: 0.94655\tvalid_0's ndcg@2: 0.978191\tvalid_0's ndcg@3: 0.979479\tvalid_0's ndcg@4: 0.979683\tvalid_0's ndcg@5: 0.97977\n",
- "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
- "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.963031\tvalid_0's ndcg@3: 0.965281\tvalid_0's ndcg@4: 0.965819\tvalid_0's ndcg@5: 0.965887\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's ndcg@1: 0.9141\tvalid_0's ndcg@2: 0.964748\tvalid_0's ndcg@3: 0.96681\tvalid_0's ndcg@4: 0.967316\tvalid_0's ndcg@5: 0.967394\n",
- "[3]\tvalid_0's ndcg@1: 0.915925\tvalid_0's ndcg@2: 0.9655\tvalid_0's ndcg@3: 0.967575\tvalid_0's ndcg@4: 0.968028\tvalid_0's ndcg@5: 0.968105\n",
- "[4]\tvalid_0's ndcg@1: 0.91915\tvalid_0's ndcg@2: 0.966943\tvalid_0's ndcg@3: 0.968968\tvalid_0's ndcg@4: 0.969334\tvalid_0's ndcg@5: 0.969373\n",
- "[5]\tvalid_0's ndcg@1: 0.920625\tvalid_0's ndcg@2: 0.967598\tvalid_0's ndcg@3: 0.969498\tvalid_0's ndcg@4: 0.969896\tvalid_0's ndcg@5: 0.969944\n",
- "[6]\tvalid_0's ndcg@1: 0.922625\tvalid_0's ndcg@2: 0.968336\tvalid_0's ndcg@3: 0.970261\tvalid_0's ndcg@4: 0.970659\tvalid_0's ndcg@5: 0.970688\n",
- "[7]\tvalid_0's ndcg@1: 0.923625\tvalid_0's ndcg@2: 0.968768\tvalid_0's ndcg@3: 0.970656\tvalid_0's ndcg@4: 0.971043\tvalid_0's ndcg@5: 0.971072\n",
- "[8]\tvalid_0's ndcg@1: 0.925825\tvalid_0's ndcg@2: 0.969612\tvalid_0's ndcg@3: 0.971462\tvalid_0's ndcg@4: 0.97186\tvalid_0's ndcg@5: 0.971879\n",
- "[9]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.969899\tvalid_0's ndcg@3: 0.971711\tvalid_0's ndcg@4: 0.97211\tvalid_0's ndcg@5: 0.972129\n",
- "[10]\tvalid_0's ndcg@1: 0.927775\tvalid_0's ndcg@2: 0.97041\tvalid_0's ndcg@3: 0.972185\tvalid_0's ndcg@4: 0.972594\tvalid_0's ndcg@5: 0.972614\n",
- "[11]\tvalid_0's ndcg@1: 0.92885\tvalid_0's ndcg@2: 0.970838\tvalid_0's ndcg@3: 0.972588\tvalid_0's ndcg@4: 0.973008\tvalid_0's ndcg@5: 0.973028\n",
- "[12]\tvalid_0's ndcg@1: 0.930325\tvalid_0's ndcg@2: 0.971367\tvalid_0's ndcg@3: 0.973129\tvalid_0's ndcg@4: 0.973549\tvalid_0's ndcg@5: 0.973569\n",
- "[13]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971631\tvalid_0's ndcg@3: 0.973443\tvalid_0's ndcg@4: 0.973842\tvalid_0's ndcg@5: 0.973871\n",
- "[14]\tvalid_0's ndcg@1: 0.931525\tvalid_0's ndcg@2: 0.971778\tvalid_0's ndcg@3: 0.973616\tvalid_0's ndcg@4: 0.973993\tvalid_0's ndcg@5: 0.974022\n",
- "[15]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.9717\tvalid_0's ndcg@3: 0.973475\tvalid_0's ndcg@4: 0.973852\tvalid_0's ndcg@5: 0.973872\n",
- "[16]\tvalid_0's ndcg@1: 0.931775\tvalid_0's ndcg@2: 0.971902\tvalid_0's ndcg@3: 0.973702\tvalid_0's ndcg@4: 0.97409\tvalid_0's ndcg@5: 0.974109\n",
- "[17]\tvalid_0's ndcg@1: 0.931425\tvalid_0's ndcg@2: 0.971805\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973967\tvalid_0's ndcg@5: 0.973986\n",
- "[18]\tvalid_0's ndcg@1: 0.931575\tvalid_0's ndcg@2: 0.971876\tvalid_0's ndcg@3: 0.973651\tvalid_0's ndcg@4: 0.974027\tvalid_0's ndcg@5: 0.974047\n",
- "[19]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.97208\tvalid_0's ndcg@3: 0.973805\tvalid_0's ndcg@4: 0.974192\tvalid_0's ndcg@5: 0.974212\n",
- "[20]\tvalid_0's ndcg@1: 0.932075\tvalid_0's ndcg@2: 0.972092\tvalid_0's ndcg@3: 0.973829\tvalid_0's ndcg@4: 0.974217\tvalid_0's ndcg@5: 0.974236\n",
- "[21]\tvalid_0's ndcg@1: 0.932675\tvalid_0's ndcg@2: 0.972282\tvalid_0's ndcg@3: 0.974057\tvalid_0's ndcg@4: 0.974444\tvalid_0's ndcg@5: 0.974454\n",
- "[22]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972358\tvalid_0's ndcg@3: 0.974146\tvalid_0's ndcg@4: 0.974533\tvalid_0's ndcg@5: 0.974543\n",
- "[23]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972478\tvalid_0's ndcg@3: 0.974253\tvalid_0's ndcg@4: 0.974651\tvalid_0's ndcg@5: 0.974661\n",
- "[24]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972539\tvalid_0's ndcg@3: 0.974351\tvalid_0's ndcg@4: 0.974739\tvalid_0's ndcg@5: 0.974749\n",
- "[25]\tvalid_0's ndcg@1: 0.93475\tvalid_0's ndcg@2: 0.973\tvalid_0's ndcg@3: 0.974788\tvalid_0's ndcg@4: 0.975197\tvalid_0's ndcg@5: 0.975206\n",
- "[26]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.97312\tvalid_0's ndcg@3: 0.974895\tvalid_0's ndcg@4: 0.975315\tvalid_0's ndcg@5: 0.975325\n",
- "[27]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.974865\tvalid_0's ndcg@4: 0.975264\tvalid_0's ndcg@5: 0.975273\n",
- "[28]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974939\tvalid_0's ndcg@4: 0.975327\tvalid_0's ndcg@5: 0.975336\n",
- "[29]\tvalid_0's ndcg@1: 0.935475\tvalid_0's ndcg@2: 0.973315\tvalid_0's ndcg@3: 0.975128\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975492\n",
- "[30]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973522\tvalid_0's ndcg@3: 0.975297\tvalid_0's ndcg@4: 0.975663\tvalid_0's ndcg@5: 0.975673\n",
- "[31]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973506\tvalid_0's ndcg@3: 0.975281\tvalid_0's ndcg@4: 0.975658\tvalid_0's ndcg@5: 0.975668\n",
- "[32]\tvalid_0's ndcg@1: 0.93675\tvalid_0's ndcg@2: 0.973833\tvalid_0's ndcg@3: 0.975595\tvalid_0's ndcg@4: 0.975961\tvalid_0's ndcg@5: 0.975971\n",
- "[33]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.973763\tvalid_0's ndcg@3: 0.975488\tvalid_0's ndcg@4: 0.975865\tvalid_0's ndcg@5: 0.975874\n",
- "[34]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973893\tvalid_0's ndcg@3: 0.975568\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975966\n"
- ]
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## LGB排序模型"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[35]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974059\tvalid_0's ndcg@3: 0.975722\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.97613\n",
- "[36]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
- "[37]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
- "[38]\tvalid_0's ndcg@1: 0.938725\tvalid_0's ndcg@2: 0.974672\tvalid_0's ndcg@3: 0.97636\tvalid_0's ndcg@4: 0.976715\tvalid_0's ndcg@5: 0.976725\n",
- "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974676\tvalid_0's ndcg@3: 0.976364\tvalid_0's ndcg@4: 0.976697\tvalid_0's ndcg@5: 0.976707\n",
- "[40]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.974867\tvalid_0's ndcg@3: 0.97653\tvalid_0's ndcg@4: 0.976874\tvalid_0's ndcg@5: 0.976884\n",
- "[41]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975042\tvalid_0's ndcg@3: 0.976705\tvalid_0's ndcg@4: 0.97705\tvalid_0's ndcg@5: 0.977059\n",
- "[42]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976784\tvalid_0's ndcg@4: 0.977129\tvalid_0's ndcg@5: 0.977138\n",
- "[43]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.97517\tvalid_0's ndcg@3: 0.97687\tvalid_0's ndcg@4: 0.977215\tvalid_0's ndcg@5: 0.977225\n",
- "[44]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.97534\tvalid_0's ndcg@3: 0.977015\tvalid_0's ndcg@4: 0.97736\tvalid_0's ndcg@5: 0.97737\n",
- "[45]\tvalid_0's ndcg@1: 0.94055\tvalid_0's ndcg@2: 0.975409\tvalid_0's ndcg@3: 0.977059\tvalid_0's ndcg@4: 0.977403\tvalid_0's ndcg@5: 0.977413\n",
- "[46]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975415\tvalid_0's ndcg@3: 0.97704\tvalid_0's ndcg@4: 0.977396\tvalid_0's ndcg@5: 0.977405\n",
- "[47]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975363\tvalid_0's ndcg@3: 0.977013\tvalid_0's ndcg@4: 0.977357\tvalid_0's ndcg@5: 0.977367\n",
- "[48]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975388\tvalid_0's ndcg@3: 0.977025\tvalid_0's ndcg@4: 0.97737\tvalid_0's ndcg@5: 0.977379\n",
- "[49]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975447\tvalid_0's ndcg@3: 0.977097\tvalid_0's ndcg@4: 0.977409\tvalid_0's ndcg@5: 0.977419\n",
- "[50]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975666\tvalid_0's ndcg@3: 0.977303\tvalid_0's ndcg@4: 0.977615\tvalid_0's ndcg@5: 0.977625\n",
- "[51]\tvalid_0's ndcg@1: 0.94135\tvalid_0's ndcg@2: 0.975751\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.97771\tvalid_0's ndcg@5: 0.97772\n",
- "[52]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.975717\tvalid_0's ndcg@3: 0.977355\tvalid_0's ndcg@4: 0.977688\tvalid_0's ndcg@5: 0.977698\n",
- "[53]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.975713\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977699\tvalid_0's ndcg@5: 0.977718\n",
- "[54]\tvalid_0's ndcg@1: 0.94185\tvalid_0's ndcg@2: 0.975857\tvalid_0's ndcg@3: 0.977557\tvalid_0's ndcg@4: 0.977869\tvalid_0's ndcg@5: 0.977889\n",
- "[55]\tvalid_0's ndcg@1: 0.941925\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.9776\tvalid_0's ndcg@4: 0.977891\tvalid_0's ndcg@5: 0.97791\n",
- "[56]\tvalid_0's ndcg@1: 0.942325\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977719\tvalid_0's ndcg@4: 0.978032\tvalid_0's ndcg@5: 0.978051\n",
- "[57]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977772\tvalid_0's ndcg@4: 0.978073\tvalid_0's ndcg@5: 0.978093\n",
- "[58]\tvalid_0's ndcg@1: 0.9425\tvalid_0's ndcg@2: 0.976081\tvalid_0's ndcg@3: 0.977806\tvalid_0's ndcg@4: 0.978108\tvalid_0's ndcg@5: 0.978127\n",
- "[59]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977788\tvalid_0's ndcg@4: 0.978079\tvalid_0's ndcg@5: 0.978098\n",
- "[60]\tvalid_0's ndcg@1: 0.942375\tvalid_0's ndcg@2: 0.976067\tvalid_0's ndcg@3: 0.977779\tvalid_0's ndcg@4: 0.97807\tvalid_0's ndcg@5: 0.978089\n",
- "[61]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976043\tvalid_0's ndcg@3: 0.97773\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.97804\n",
- "[62]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976117\tvalid_0's ndcg@3: 0.977792\tvalid_0's ndcg@4: 0.978093\tvalid_0's ndcg@5: 0.978112\n",
- "[63]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977881\tvalid_0's ndcg@4: 0.978182\tvalid_0's ndcg@5: 0.978201\n",
- "[64]\tvalid_0's ndcg@1: 0.942925\tvalid_0's ndcg@2: 0.976254\tvalid_0's ndcg@3: 0.977966\tvalid_0's ndcg@4: 0.978268\tvalid_0's ndcg@5: 0.978287\n",
- "[65]\tvalid_0's ndcg@1: 0.9431\tvalid_0's ndcg@2: 0.97635\tvalid_0's ndcg@3: 0.978025\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978357\n",
- "[66]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976445\tvalid_0's ndcg@3: 0.978132\tvalid_0's ndcg@4: 0.978445\tvalid_0's ndcg@5: 0.978464\n",
- "[67]\tvalid_0's ndcg@1: 0.943275\tvalid_0's ndcg@2: 0.976399\tvalid_0's ndcg@3: 0.978074\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978416\n",
- "[68]\tvalid_0's ndcg@1: 0.943325\tvalid_0's ndcg@2: 0.976401\tvalid_0's ndcg@3: 0.978089\tvalid_0's ndcg@4: 0.978412\tvalid_0's ndcg@5: 0.978431\n",
- "[69]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976578\tvalid_0's ndcg@3: 0.97819\tvalid_0's ndcg@4: 0.978546\tvalid_0's ndcg@5: 0.978565\n",
- "[70]\tvalid_0's ndcg@1: 0.944025\tvalid_0's ndcg@2: 0.976707\tvalid_0's ndcg@3: 0.97832\tvalid_0's ndcg@4: 0.978675\tvalid_0's ndcg@5: 0.978694\n",
- "[71]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976772\tvalid_0's ndcg@3: 0.978384\tvalid_0's ndcg@4: 0.97874\tvalid_0's ndcg@5: 0.978759\n",
- "[72]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978409\tvalid_0's ndcg@4: 0.978765\tvalid_0's ndcg@5: 0.978784\n",
- "[73]\tvalid_0's ndcg@1: 0.94445\tvalid_0's ndcg@2: 0.976864\tvalid_0's ndcg@3: 0.978464\tvalid_0's ndcg@4: 0.97883\tvalid_0's ndcg@5: 0.978849\n",
- "[74]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
- "[75]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
- "[76]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.97696\tvalid_0's ndcg@3: 0.978535\tvalid_0's ndcg@4: 0.978901\tvalid_0's ndcg@5: 0.978921\n",
- "[77]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
- "[78]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
- "[79]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976907\tvalid_0's ndcg@3: 0.978507\tvalid_0's ndcg@4: 0.978863\tvalid_0's ndcg@5: 0.978882\n",
- "[80]\tvalid_0's ndcg@1: 0.94455\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97851\tvalid_0's ndcg@4: 0.978865\tvalid_0's ndcg@5: 0.978885\n",
- "[81]\tvalid_0's ndcg@1: 0.944725\tvalid_0's ndcg@2: 0.97695\tvalid_0's ndcg@3: 0.978575\tvalid_0's ndcg@4: 0.978919\tvalid_0's ndcg@5: 0.978948\n",
- "[82]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.977103\tvalid_0's ndcg@3: 0.978765\tvalid_0's ndcg@4: 0.97911\tvalid_0's ndcg@5: 0.979129\n",
- "[83]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977066\tvalid_0's ndcg@3: 0.978716\tvalid_0's ndcg@4: 0.979071\tvalid_0's ndcg@5: 0.97909\n",
- "[84]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.97715\tvalid_0's ndcg@3: 0.978775\tvalid_0's ndcg@4: 0.97912\tvalid_0's ndcg@5: 0.979139\n",
- "[85]\tvalid_0's ndcg@1: 0.945025\tvalid_0's ndcg@2: 0.977092\tvalid_0's ndcg@3: 0.978692\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979067\n",
- "[86]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977172\tvalid_0's ndcg@3: 0.97876\tvalid_0's ndcg@4: 0.979115\tvalid_0's ndcg@5: 0.979135\n",
- "[87]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.97879\tvalid_0's ndcg@4: 0.979156\tvalid_0's ndcg@5: 0.979166\n",
- "[88]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.978815\tvalid_0's ndcg@4: 0.979149\tvalid_0's ndcg@5: 0.979168\n",
- "[89]\tvalid_0's ndcg@1: 0.94555\tvalid_0's ndcg@2: 0.977333\tvalid_0's ndcg@3: 0.978933\tvalid_0's ndcg@4: 0.979267\tvalid_0's ndcg@5: 0.979277\n",
- "[90]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977462\tvalid_0's ndcg@3: 0.979062\tvalid_0's ndcg@4: 0.979396\tvalid_0's ndcg@5: 0.979406\n",
- "[91]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977481\tvalid_0's ndcg@3: 0.979081\tvalid_0's ndcg@4: 0.979414\tvalid_0's ndcg@5: 0.979424\n",
- "[92]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977437\tvalid_0's ndcg@3: 0.97905\tvalid_0's ndcg@4: 0.979384\tvalid_0's ndcg@5: 0.979393\n",
- "[93]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977421\tvalid_0's ndcg@3: 0.979046\tvalid_0's ndcg@4: 0.97938\tvalid_0's ndcg@5: 0.97939\n",
- "[94]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977431\tvalid_0's ndcg@3: 0.979068\tvalid_0's ndcg@4: 0.979391\tvalid_0's ndcg@5: 0.979401\n",
- "[95]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977449\tvalid_0's ndcg@3: 0.979074\tvalid_0's ndcg@4: 0.979408\tvalid_0's ndcg@5: 0.979418\n",
- "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979127\tvalid_0's ndcg@4: 0.979461\tvalid_0's ndcg@5: 0.97947\n"
- ]
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:07.787698Z",
+ "start_time": "2020-11-18T04:21:07.536514Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 防止中间出错之后重新读取数据\n",
+ "trn_user_item_feats_df_rank_model = trn_user_item_feats_df.copy()\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_rank_model = val_user_item_feats_df.copy()\n",
+ " \n",
+ "tst_user_item_feats_df_rank_model = tst_user_item_feats_df.copy()"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[97]\tvalid_0's ndcg@1: 0.946375\tvalid_0's ndcg@2: 0.977622\tvalid_0's ndcg@3: 0.979222\tvalid_0's ndcg@4: 0.979577\tvalid_0's ndcg@5: 0.979577\n",
- "[98]\tvalid_0's ndcg@1: 0.946625\tvalid_0's ndcg@2: 0.977714\tvalid_0's ndcg@3: 0.979339\tvalid_0's ndcg@4: 0.979673\tvalid_0's ndcg@5: 0.979673\n",
- "[99]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.977739\tvalid_0's ndcg@3: 0.979352\tvalid_0's ndcg@4: 0.979685\tvalid_0's ndcg@5: 0.979685\n",
- "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n"
- ]
- }
- ],
- "source": [
- "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
- "# 这一部分与前面的单独训练和验证是分开的\n",
- "def get_kfold_users(trn_df, n=5):\n",
- " user_ids = trn_df['user_id'].unique()\n",
- " user_set = [user_ids[i::n] for i in range(n)]\n",
- " return user_set\n",
- "\n",
- "k_fold = 5\n",
- "trn_df = trn_user_item_feats_df_rank_model\n",
- "user_set = get_kfold_users(trn_df, n=k_fold)\n",
- "\n",
- "score_list = []\n",
- "score_df = trn_df[['user_id', 'click_article_id','label']]\n",
- "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
- "\n",
- "# 五折交叉验证,并将中间结果保存用于staking\n",
- "for n_fold, valid_user in enumerate(user_set):\n",
- " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
- " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
- " \n",
- " # 训练集与验证集的用户分组\n",
- " train_idx.sort_values(by=['user_id'], inplace=True)\n",
- " g_train = train_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
- " \n",
- " valid_idx.sort_values(by=['user_id'], inplace=True)\n",
- " g_val = valid_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
- " \n",
- " # 定义模型\n",
- " lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) \n",
- " # 训练模型\n",
- " lgb_ranker.fit(train_idx[lgb_cols], train_idx['label'], group=g_train,\n",
- " eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], eval_group= [g_val], \n",
- " eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
- " \n",
- " # 预测验证集结果\n",
- " valid_idx['pred_score'] = lgb_ranker.predict(valid_idx[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
- " \n",
- " # 对输出结果进行归一化\n",
- " valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
- " \n",
- " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
- " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
- " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
- " \n",
- " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
- " if not offline:\n",
- " sub_preds += lgb_ranker.predict(tst_user_item_feats_df_rank_model[lgb_cols], lgb_ranker.best_iteration_)\n",
- " \n",
- "score_df_ = pd.concat(score_list, axis=0)\n",
- "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
- "# 保存训练集交叉验证产生的新特征\n",
- "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_ranker_feats.csv', index=False)\n",
- " \n",
- "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
- "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
- "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- "\n",
- "# 保存测试集交叉验证的新特征\n",
- "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_ranker_feats.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 14,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:22:52.604397Z",
- "start_time": "2020-11-18T04:22:43.253034Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "# 单模型生成提交结果\n",
- "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_ranker')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## LGB分类模型"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 15,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:22:58.259730Z",
- "start_time": "2020-11-18T04:22:58.254297Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型及参数的定义\n",
- "lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 16,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:23:11.258774Z",
- "start_time": "2020-11-18T04:23:00.861936Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型训练\n",
- "if offline:\n",
- " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'],\n",
- " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
- " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
- "else:\n",
- " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 17,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:23:19.591396Z",
- "start_time": "2020-11-18T04:23:13.813850Z"
- }
- },
- "outputs": [],
- "source": [
- "# 模型预测\n",
- "tst_user_item_feats_df['pred_score'] = lgb_Classfication.predict_proba(tst_user_item_feats_df[lgb_cols])[:,1]\n",
- "\n",
- "# 将这里的排序结果保存一份,用户后面的模型融合\n",
- "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_cls_score.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 18,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:23:32.352931Z",
- "start_time": "2020-11-18T04:23:22.346609Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_cls')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": 19,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:24:11.241196Z",
- "start_time": "2020-11-18T04:23:41.377394Z"
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:10.839656Z",
+ "start_time": "2020-11-18T04:21:10.833109Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 定义特征列\n",
+ "lgb_cols = ['sim0', 'time_diff0', 'word_diff0','sim_max', 'sim_min', 'sim_sum', \n",
+ " 'sim_mean', 'score','click_size', 'time_diff_mean', 'active_level',\n",
+ " 'click_environment','click_deviceGroup', 'click_os', 'click_country', \n",
+ " 'click_region','click_referrer_type', 'user_time_hob1', 'user_time_hob2',\n",
+ " 'words_hbo', 'category_id', 'created_at_ts','words_count']"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[1]\tvalid_0's auc: 0.764896\tvalid_0's binary_logloss: 0.522153\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.767857\tvalid_0's binary_logloss: 0.52057\n",
- "[3]\tvalid_0's auc: 0.783096\tvalid_0's binary_logloss: 0.519584\n",
- "[4]\tvalid_0's auc: 0.784354\tvalid_0's binary_logloss: 0.518485\n",
- "[5]\tvalid_0's auc: 0.790554\tvalid_0's binary_logloss: 0.516886\n",
- "[6]\tvalid_0's auc: 0.791954\tvalid_0's binary_logloss: 0.515334\n",
- "[7]\tvalid_0's auc: 0.794257\tvalid_0's binary_logloss: 0.514032\n",
- "[8]\tvalid_0's auc: 0.795222\tvalid_0's binary_logloss: 0.512516\n",
- "[9]\tvalid_0's auc: 0.795417\tvalid_0's binary_logloss: 0.511671\n",
- "[10]\tvalid_0's auc: 0.795913\tvalid_0's binary_logloss: 0.510226\n",
- "[11]\tvalid_0's auc: 0.798222\tvalid_0's binary_logloss: 0.508858\n",
- "[12]\tvalid_0's auc: 0.79825\tvalid_0's binary_logloss: 0.507928\n",
- "[13]\tvalid_0's auc: 0.798842\tvalid_0's binary_logloss: 0.50708\n",
- "[14]\tvalid_0's auc: 0.798935\tvalid_0's binary_logloss: 0.505752\n",
- "[15]\tvalid_0's auc: 0.799543\tvalid_0's binary_logloss: 0.504388\n",
- "[16]\tvalid_0's auc: 0.800844\tvalid_0's binary_logloss: 0.503126\n",
- "[17]\tvalid_0's auc: 0.800855\tvalid_0's binary_logloss: 0.501809\n",
- "[18]\tvalid_0's auc: 0.801653\tvalid_0's binary_logloss: 0.500676\n",
- "[19]\tvalid_0's auc: 0.801518\tvalid_0's binary_logloss: 0.49987\n",
- "[20]\tvalid_0's auc: 0.801662\tvalid_0's binary_logloss: 0.498625\n",
- "[21]\tvalid_0's auc: 0.802093\tvalid_0's binary_logloss: 0.498113\n",
- "[22]\tvalid_0's auc: 0.803071\tvalid_0's binary_logloss: 0.496933\n",
- "[23]\tvalid_0's auc: 0.803222\tvalid_0's binary_logloss: 0.495864\n",
- "[24]\tvalid_0's auc: 0.802927\tvalid_0's binary_logloss: 0.494691\n",
- "[25]\tvalid_0's auc: 0.802581\tvalid_0's binary_logloss: 0.493543\n",
- "[26]\tvalid_0's auc: 0.802965\tvalid_0's binary_logloss: 0.492444\n",
- "[27]\tvalid_0's auc: 0.80298\tvalid_0's binary_logloss: 0.491336\n",
- "[28]\tvalid_0's auc: 0.803226\tvalid_0's binary_logloss: 0.490275\n",
- "[29]\tvalid_0's auc: 0.803436\tvalid_0's binary_logloss: 0.489126\n",
- "[30]\tvalid_0's auc: 0.803796\tvalid_0's binary_logloss: 0.48802\n",
- "[31]\tvalid_0's auc: 0.803601\tvalid_0's binary_logloss: 0.486988\n",
- "[32]\tvalid_0's auc: 0.804416\tvalid_0's binary_logloss: 0.485972\n",
- "[33]\tvalid_0's auc: 0.804529\tvalid_0's binary_logloss: 0.484939\n",
- "[34]\tvalid_0's auc: 0.804534\tvalid_0's binary_logloss: 0.483927\n",
- "[35]\tvalid_0's auc: 0.804819\tvalid_0's binary_logloss: 0.483271\n",
- "[36]\tvalid_0's auc: 0.804774\tvalid_0's binary_logloss: 0.482273\n",
- "[37]\tvalid_0's auc: 0.805237\tvalid_0's binary_logloss: 0.481639\n",
- "[38]\tvalid_0's auc: 0.805546\tvalid_0's binary_logloss: 0.480959\n",
- "[39]\tvalid_0's auc: 0.805598\tvalid_0's binary_logloss: 0.479955\n",
- "[40]\tvalid_0's auc: 0.806011\tvalid_0's binary_logloss: 0.47903\n",
- "[41]\tvalid_0's auc: 0.806664\tvalid_0's binary_logloss: 0.478439\n",
- "[42]\tvalid_0's auc: 0.807021\tvalid_0's binary_logloss: 0.477798\n",
- "[43]\tvalid_0's auc: 0.80726\tvalid_0's binary_logloss: 0.476829\n",
- "[44]\tvalid_0's auc: 0.807157\tvalid_0's binary_logloss: 0.475976\n",
- "[45]\tvalid_0's auc: 0.807788\tvalid_0's binary_logloss: 0.475056\n",
- "[46]\tvalid_0's auc: 0.80805\tvalid_0's binary_logloss: 0.474446\n",
- "[47]\tvalid_0's auc: 0.808097\tvalid_0's binary_logloss: 0.473576\n",
- "[48]\tvalid_0's auc: 0.80815\tvalid_0's binary_logloss: 0.472676\n",
- "[49]\tvalid_0's auc: 0.808304\tvalid_0's binary_logloss: 0.471918\n",
- "[50]\tvalid_0's auc: 0.808749\tvalid_0's binary_logloss: 0.471481\n",
- "[51]\tvalid_0's auc: 0.808972\tvalid_0's binary_logloss: 0.471104\n",
- "[52]\tvalid_0's auc: 0.809326\tvalid_0's binary_logloss: 0.470289\n",
- "[53]\tvalid_0's auc: 0.809472\tvalid_0's binary_logloss: 0.469508\n",
- "[54]\tvalid_0's auc: 0.809505\tvalid_0's binary_logloss: 0.46869\n",
- "[55]\tvalid_0's auc: 0.809594\tvalid_0's binary_logloss: 0.467885\n",
- "[56]\tvalid_0's auc: 0.809847\tvalid_0's binary_logloss: 0.467356\n",
- "[57]\tvalid_0's auc: 0.810262\tvalid_0's binary_logloss: 0.466531\n",
- "[58]\tvalid_0's auc: 0.810407\tvalid_0's binary_logloss: 0.46573\n",
- "[59]\tvalid_0's auc: 0.810618\tvalid_0's binary_logloss: 0.465205\n",
- "[60]\tvalid_0's auc: 0.81066\tvalid_0's binary_logloss: 0.464435\n",
- "[61]\tvalid_0's auc: 0.810638\tvalid_0's binary_logloss: 0.463721\n",
- "[62]\tvalid_0's auc: 0.810658\tvalid_0's binary_logloss: 0.462982\n",
- "[63]\tvalid_0's auc: 0.811106\tvalid_0's binary_logloss: 0.462246\n",
- "[64]\tvalid_0's auc: 0.811313\tvalid_0's binary_logloss: 0.461748\n",
- "[65]\tvalid_0's auc: 0.811351\tvalid_0's binary_logloss: 0.461038\n",
- "[66]\tvalid_0's auc: 0.811433\tvalid_0's binary_logloss: 0.460323\n",
- "[67]\tvalid_0's auc: 0.81158\tvalid_0's binary_logloss: 0.459662\n",
- "[68]\tvalid_0's auc: 0.811561\tvalid_0's binary_logloss: 0.458988\n",
- "[69]\tvalid_0's auc: 0.811748\tvalid_0's binary_logloss: 0.458592\n",
- "[70]\tvalid_0's auc: 0.811919\tvalid_0's binary_logloss: 0.457934\n",
- "[71]\tvalid_0's auc: 0.812073\tvalid_0's binary_logloss: 0.457508\n",
- "[72]\tvalid_0's auc: 0.812273\tvalid_0's binary_logloss: 0.457038\n",
- "[73]\tvalid_0's auc: 0.812561\tvalid_0's binary_logloss: 0.456439\n",
- "[74]\tvalid_0's auc: 0.812633\tvalid_0's binary_logloss: 0.455789\n",
- "[75]\tvalid_0's auc: 0.812757\tvalid_0's binary_logloss: 0.455173\n",
- "[76]\tvalid_0's auc: 0.812923\tvalid_0's binary_logloss: 0.454533\n",
- "[77]\tvalid_0's auc: 0.81295\tvalid_0's binary_logloss: 0.45392\n",
- "[78]\tvalid_0's auc: 0.813073\tvalid_0's binary_logloss: 0.453517\n",
- "[79]\tvalid_0's auc: 0.813202\tvalid_0's binary_logloss: 0.452932\n",
- "[80]\tvalid_0's auc: 0.813611\tvalid_0's binary_logloss: 0.452285\n",
- "[81]\tvalid_0's auc: 0.813769\tvalid_0's binary_logloss: 0.45191\n",
- "[82]\tvalid_0's auc: 0.814468\tvalid_0's binary_logloss: 0.451455\n",
- "[83]\tvalid_0's auc: 0.814656\tvalid_0's binary_logloss: 0.450885\n",
- "[84]\tvalid_0's auc: 0.814755\tvalid_0's binary_logloss: 0.450308\n",
- "[85]\tvalid_0's auc: 0.814824\tvalid_0's binary_logloss: 0.449739\n",
- "[86]\tvalid_0's auc: 0.81499\tvalid_0's binary_logloss: 0.449348\n",
- "[87]\tvalid_0's auc: 0.815232\tvalid_0's binary_logloss: 0.448759\n",
- "[88]\tvalid_0's auc: 0.815452\tvalid_0's binary_logloss: 0.44823\n",
- "[89]\tvalid_0's auc: 0.815593\tvalid_0's binary_logloss: 0.447861\n",
- "[90]\tvalid_0's auc: 0.815591\tvalid_0's binary_logloss: 0.447323\n",
- "[91]\tvalid_0's auc: 0.815672\tvalid_0's binary_logloss: 0.446796\n",
- "[92]\tvalid_0's auc: 0.815875\tvalid_0's binary_logloss: 0.446472\n",
- "[93]\tvalid_0's auc: 0.815984\tvalid_0's binary_logloss: 0.445961\n",
- "[94]\tvalid_0's auc: 0.816026\tvalid_0's binary_logloss: 0.445439\n",
- "[95]\tvalid_0's auc: 0.816172\tvalid_0's binary_logloss: 0.444909\n",
- "[96]\tvalid_0's auc: 0.816321\tvalid_0's binary_logloss: 0.444413\n",
- "[97]\tvalid_0's auc: 0.816751\tvalid_0's binary_logloss: 0.44405\n",
- "[98]\tvalid_0's auc: 0.817226\tvalid_0's binary_logloss: 0.443626\n",
- "[99]\tvalid_0's auc: 0.817286\tvalid_0's binary_logloss: 0.443136\n",
- "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
- "[1]\tvalid_0's auc: 0.771584\tvalid_0's binary_logloss: 0.527139\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.775446\tvalid_0's binary_logloss: 0.525462\n",
- "[3]\tvalid_0's auc: 0.790092\tvalid_0's binary_logloss: 0.524461\n",
- "[4]\tvalid_0's auc: 0.791432\tvalid_0's binary_logloss: 0.523322\n",
- "[5]\tvalid_0's auc: 0.797482\tvalid_0's binary_logloss: 0.521614\n",
- "[6]\tvalid_0's auc: 0.79893\tvalid_0's binary_logloss: 0.520007\n",
- "[7]\tvalid_0's auc: 0.800753\tvalid_0's binary_logloss: 0.5187\n",
- "[8]\tvalid_0's auc: 0.802197\tvalid_0's binary_logloss: 0.517125\n",
- "[9]\tvalid_0's auc: 0.802828\tvalid_0's binary_logloss: 0.516269\n",
- "[10]\tvalid_0's auc: 0.803496\tvalid_0's binary_logloss: 0.51474\n",
- "[11]\tvalid_0's auc: 0.804972\tvalid_0's binary_logloss: 0.513321\n",
- "[12]\tvalid_0's auc: 0.804995\tvalid_0's binary_logloss: 0.512334\n",
- "[13]\tvalid_0's auc: 0.80525\tvalid_0's binary_logloss: 0.51151\n",
- "[14]\tvalid_0's auc: 0.805026\tvalid_0's binary_logloss: 0.510149\n",
- "[15]\tvalid_0's auc: 0.805622\tvalid_0's binary_logloss: 0.508708\n",
- "[16]\tvalid_0's auc: 0.806974\tvalid_0's binary_logloss: 0.507384\n",
- "[17]\tvalid_0's auc: 0.807045\tvalid_0's binary_logloss: 0.506017\n",
- "[18]\tvalid_0's auc: 0.807265\tvalid_0's binary_logloss: 0.504853\n",
- "[19]\tvalid_0's auc: 0.807126\tvalid_0's binary_logloss: 0.503972\n",
- "[20]\tvalid_0's auc: 0.806948\tvalid_0's binary_logloss: 0.502693\n",
- "[21]\tvalid_0's auc: 0.807315\tvalid_0's binary_logloss: 0.502166\n",
- "[22]\tvalid_0's auc: 0.808067\tvalid_0's binary_logloss: 0.500948\n",
- "[23]\tvalid_0's auc: 0.808226\tvalid_0's binary_logloss: 0.49987\n",
- "[24]\tvalid_0's auc: 0.808268\tvalid_0's binary_logloss: 0.498623\n",
- "[25]\tvalid_0's auc: 0.808569\tvalid_0's binary_logloss: 0.497389\n",
- "[26]\tvalid_0's auc: 0.809069\tvalid_0's binary_logloss: 0.49624\n",
- "[27]\tvalid_0's auc: 0.809312\tvalid_0's binary_logloss: 0.495095\n",
- "[28]\tvalid_0's auc: 0.809549\tvalid_0's binary_logloss: 0.494012\n",
- "[29]\tvalid_0's auc: 0.809944\tvalid_0's binary_logloss: 0.492834\n",
- "[30]\tvalid_0's auc: 0.810047\tvalid_0's binary_logloss: 0.491735\n",
- "[31]\tvalid_0's auc: 0.810086\tvalid_0's binary_logloss: 0.490633\n"
- ]
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:14.126608Z",
+ "start_time": "2020-11-18T04:21:13.493653Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序模型分组\n",
+ "trn_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
+ "g_train = trn_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_rank_model.sort_values(by=['user_id'], inplace=True)\n",
+ " g_val = val_user_item_feats_df_rank_model.groupby(['user_id'], as_index=False).count()[\"label\"].values"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[32]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.489595\n",
- "[33]\tvalid_0's auc: 0.810539\tvalid_0's binary_logloss: 0.488536\n",
- "[34]\tvalid_0's auc: 0.810529\tvalid_0's binary_logloss: 0.487489\n",
- "[35]\tvalid_0's auc: 0.810932\tvalid_0's binary_logloss: 0.486775\n",
- "[36]\tvalid_0's auc: 0.810769\tvalid_0's binary_logloss: 0.48577\n",
- "[37]\tvalid_0's auc: 0.811363\tvalid_0's binary_logloss: 0.485123\n",
- "[38]\tvalid_0's auc: 0.811801\tvalid_0's binary_logloss: 0.484413\n",
- "[39]\tvalid_0's auc: 0.811987\tvalid_0's binary_logloss: 0.483371\n",
- "[40]\tvalid_0's auc: 0.812268\tvalid_0's binary_logloss: 0.482407\n",
- "[41]\tvalid_0's auc: 0.813297\tvalid_0's binary_logloss: 0.481742\n",
- "[42]\tvalid_0's auc: 0.813453\tvalid_0's binary_logloss: 0.481108\n",
- "[43]\tvalid_0's auc: 0.813603\tvalid_0's binary_logloss: 0.480163\n",
- "[44]\tvalid_0's auc: 0.813654\tvalid_0's binary_logloss: 0.479239\n",
- "[45]\tvalid_0's auc: 0.814267\tvalid_0's binary_logloss: 0.478299\n",
- "[46]\tvalid_0's auc: 0.81455\tvalid_0's binary_logloss: 0.477678\n",
- "[47]\tvalid_0's auc: 0.81452\tvalid_0's binary_logloss: 0.476766\n",
- "[48]\tvalid_0's auc: 0.814925\tvalid_0's binary_logloss: 0.475815\n",
- "[49]\tvalid_0's auc: 0.814907\tvalid_0's binary_logloss: 0.47503\n",
- "[50]\tvalid_0's auc: 0.815278\tvalid_0's binary_logloss: 0.474588\n",
- "[51]\tvalid_0's auc: 0.815535\tvalid_0's binary_logloss: 0.474171\n",
- "[52]\tvalid_0's auc: 0.815685\tvalid_0's binary_logloss: 0.473335\n",
- "[53]\tvalid_0's auc: 0.815787\tvalid_0's binary_logloss: 0.472509\n",
- "[54]\tvalid_0's auc: 0.815827\tvalid_0's binary_logloss: 0.471686\n",
- "[55]\tvalid_0's auc: 0.815871\tvalid_0's binary_logloss: 0.470838\n",
- "[56]\tvalid_0's auc: 0.816238\tvalid_0's binary_logloss: 0.470285\n",
- "[57]\tvalid_0's auc: 0.816269\tvalid_0's binary_logloss: 0.469495\n",
- "[58]\tvalid_0's auc: 0.816528\tvalid_0's binary_logloss: 0.468654\n",
- "[59]\tvalid_0's auc: 0.816706\tvalid_0's binary_logloss: 0.468122\n",
- "[60]\tvalid_0's auc: 0.816821\tvalid_0's binary_logloss: 0.467352\n",
- "[61]\tvalid_0's auc: 0.816759\tvalid_0's binary_logloss: 0.466622\n",
- "[62]\tvalid_0's auc: 0.81682\tvalid_0's binary_logloss: 0.465867\n",
- "[63]\tvalid_0's auc: 0.817251\tvalid_0's binary_logloss: 0.465112\n",
- "[64]\tvalid_0's auc: 0.817476\tvalid_0's binary_logloss: 0.464589\n",
- "[65]\tvalid_0's auc: 0.817613\tvalid_0's binary_logloss: 0.463831\n",
- "[66]\tvalid_0's auc: 0.817648\tvalid_0's binary_logloss: 0.463098\n",
- "[67]\tvalid_0's auc: 0.817719\tvalid_0's binary_logloss: 0.462414\n",
- "[68]\tvalid_0's auc: 0.817814\tvalid_0's binary_logloss: 0.461727\n",
- "[69]\tvalid_0's auc: 0.817973\tvalid_0's binary_logloss: 0.461329\n",
- "[70]\tvalid_0's auc: 0.818108\tvalid_0's binary_logloss: 0.460674\n",
- "[71]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.460222\n",
- "[72]\tvalid_0's auc: 0.818456\tvalid_0's binary_logloss: 0.45977\n",
- "[73]\tvalid_0's auc: 0.818727\tvalid_0's binary_logloss: 0.459157\n",
- "[74]\tvalid_0's auc: 0.818988\tvalid_0's binary_logloss: 0.458437\n",
- "[75]\tvalid_0's auc: 0.819144\tvalid_0's binary_logloss: 0.457808\n",
- "[76]\tvalid_0's auc: 0.819259\tvalid_0's binary_logloss: 0.457159\n",
- "[77]\tvalid_0's auc: 0.819343\tvalid_0's binary_logloss: 0.456512\n",
- "[78]\tvalid_0's auc: 0.81954\tvalid_0's binary_logloss: 0.456045\n",
- "[79]\tvalid_0's auc: 0.819687\tvalid_0's binary_logloss: 0.455416\n",
- "[80]\tvalid_0's auc: 0.819958\tvalid_0's binary_logloss: 0.454765\n",
- "[81]\tvalid_0's auc: 0.820115\tvalid_0's binary_logloss: 0.45436\n",
- "[82]\tvalid_0's auc: 0.820536\tvalid_0's binary_logloss: 0.453965\n",
- "[83]\tvalid_0's auc: 0.820649\tvalid_0's binary_logloss: 0.453383\n",
- "[84]\tvalid_0's auc: 0.820663\tvalid_0's binary_logloss: 0.452804\n",
- "[85]\tvalid_0's auc: 0.820809\tvalid_0's binary_logloss: 0.452167\n",
- "[86]\tvalid_0's auc: 0.821024\tvalid_0's binary_logloss: 0.451735\n",
- "[87]\tvalid_0's auc: 0.821124\tvalid_0's binary_logloss: 0.451167\n",
- "[88]\tvalid_0's auc: 0.821243\tvalid_0's binary_logloss: 0.45061\n",
- "[89]\tvalid_0's auc: 0.821404\tvalid_0's binary_logloss: 0.450215\n",
- "[90]\tvalid_0's auc: 0.821488\tvalid_0's binary_logloss: 0.449656\n",
- "[91]\tvalid_0's auc: 0.821538\tvalid_0's binary_logloss: 0.449107\n",
- "[92]\tvalid_0's auc: 0.82172\tvalid_0's binary_logloss: 0.448752\n",
- "[93]\tvalid_0's auc: 0.821809\tvalid_0's binary_logloss: 0.448188\n",
- "[94]\tvalid_0's auc: 0.82184\tvalid_0's binary_logloss: 0.447659\n",
- "[95]\tvalid_0's auc: 0.821971\tvalid_0's binary_logloss: 0.447108\n",
- "[96]\tvalid_0's auc: 0.822086\tvalid_0's binary_logloss: 0.446596\n",
- "[97]\tvalid_0's auc: 0.82247\tvalid_0's binary_logloss: 0.446244\n",
- "[98]\tvalid_0's auc: 0.822951\tvalid_0's binary_logloss: 0.445812\n",
- "[99]\tvalid_0's auc: 0.822991\tvalid_0's binary_logloss: 0.445329\n",
- "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
- "[1]\tvalid_0's auc: 0.769525\tvalid_0's binary_logloss: 0.526256\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.775857\tvalid_0's binary_logloss: 0.524594\n",
- "[3]\tvalid_0's auc: 0.785307\tvalid_0's binary_logloss: 0.523606\n",
- "[4]\tvalid_0's auc: 0.786356\tvalid_0's binary_logloss: 0.522495\n",
- "[5]\tvalid_0's auc: 0.793385\tvalid_0's binary_logloss: 0.520812\n",
- "[6]\tvalid_0's auc: 0.794014\tvalid_0's binary_logloss: 0.519253\n",
- "[7]\tvalid_0's auc: 0.795454\tvalid_0's binary_logloss: 0.517961\n",
- "[8]\tvalid_0's auc: 0.79807\tvalid_0's binary_logloss: 0.516363\n",
- "[9]\tvalid_0's auc: 0.798756\tvalid_0's binary_logloss: 0.51548\n",
- "[10]\tvalid_0's auc: 0.798314\tvalid_0's binary_logloss: 0.514021\n",
- "[11]\tvalid_0's auc: 0.799343\tvalid_0's binary_logloss: 0.512678\n",
- "[12]\tvalid_0's auc: 0.799573\tvalid_0's binary_logloss: 0.511708\n",
- "[13]\tvalid_0's auc: 0.799563\tvalid_0's binary_logloss: 0.510892\n",
- "[14]\tvalid_0's auc: 0.800333\tvalid_0's binary_logloss: 0.509532\n",
- "[15]\tvalid_0's auc: 0.800672\tvalid_0's binary_logloss: 0.508117\n",
- "[16]\tvalid_0's auc: 0.801953\tvalid_0's binary_logloss: 0.506866\n",
- "[17]\tvalid_0's auc: 0.802078\tvalid_0's binary_logloss: 0.5055\n",
- "[18]\tvalid_0's auc: 0.802449\tvalid_0's binary_logloss: 0.504358\n",
- "[19]\tvalid_0's auc: 0.802329\tvalid_0's binary_logloss: 0.503503\n",
- "[20]\tvalid_0's auc: 0.802437\tvalid_0's binary_logloss: 0.502233\n",
- "[21]\tvalid_0's auc: 0.802653\tvalid_0's binary_logloss: 0.50174\n",
- "[22]\tvalid_0's auc: 0.803753\tvalid_0's binary_logloss: 0.50056\n",
- "[23]\tvalid_0's auc: 0.803956\tvalid_0's binary_logloss: 0.499496\n",
- "[24]\tvalid_0's auc: 0.804231\tvalid_0's binary_logloss: 0.498283\n",
- "[25]\tvalid_0's auc: 0.804554\tvalid_0's binary_logloss: 0.497059\n",
- "[26]\tvalid_0's auc: 0.805133\tvalid_0's binary_logloss: 0.495963\n",
- "[27]\tvalid_0's auc: 0.805333\tvalid_0's binary_logloss: 0.494842\n",
- "[28]\tvalid_0's auc: 0.805644\tvalid_0's binary_logloss: 0.493771\n",
- "[29]\tvalid_0's auc: 0.806029\tvalid_0's binary_logloss: 0.492598\n",
- "[30]\tvalid_0's auc: 0.806321\tvalid_0's binary_logloss: 0.491474\n",
- "[31]\tvalid_0's auc: 0.806201\tvalid_0's binary_logloss: 0.490419\n",
- "[32]\tvalid_0's auc: 0.806671\tvalid_0's binary_logloss: 0.489393\n",
- "[33]\tvalid_0's auc: 0.806899\tvalid_0's binary_logloss: 0.488331\n",
- "[34]\tvalid_0's auc: 0.807105\tvalid_0's binary_logloss: 0.487277\n",
- "[35]\tvalid_0's auc: 0.807257\tvalid_0's binary_logloss: 0.486592\n",
- "[36]\tvalid_0's auc: 0.80729\tvalid_0's binary_logloss: 0.485607\n",
- "[37]\tvalid_0's auc: 0.807752\tvalid_0's binary_logloss: 0.484951\n",
- "[38]\tvalid_0's auc: 0.808191\tvalid_0's binary_logloss: 0.484269\n",
- "[39]\tvalid_0's auc: 0.808417\tvalid_0's binary_logloss: 0.483242\n",
- "[40]\tvalid_0's auc: 0.808761\tvalid_0's binary_logloss: 0.482291\n",
- "[41]\tvalid_0's auc: 0.80965\tvalid_0's binary_logloss: 0.48164\n",
- "[42]\tvalid_0's auc: 0.810065\tvalid_0's binary_logloss: 0.480962\n",
- "[43]\tvalid_0's auc: 0.810209\tvalid_0's binary_logloss: 0.479995\n",
- "[44]\tvalid_0's auc: 0.810091\tvalid_0's binary_logloss: 0.479077\n",
- "[45]\tvalid_0's auc: 0.810573\tvalid_0's binary_logloss: 0.478185\n",
- "[46]\tvalid_0's auc: 0.810924\tvalid_0's binary_logloss: 0.477558\n",
- "[47]\tvalid_0's auc: 0.810951\tvalid_0's binary_logloss: 0.476662\n",
- "[48]\tvalid_0's auc: 0.811101\tvalid_0's binary_logloss: 0.475745\n",
- "[49]\tvalid_0's auc: 0.811269\tvalid_0's binary_logloss: 0.474951\n",
- "[50]\tvalid_0's auc: 0.81173\tvalid_0's binary_logloss: 0.474514\n",
- "[51]\tvalid_0's auc: 0.811937\tvalid_0's binary_logloss: 0.474114\n",
- "[52]\tvalid_0's auc: 0.812136\tvalid_0's binary_logloss: 0.473297\n",
- "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.472497\n",
- "[54]\tvalid_0's auc: 0.812121\tvalid_0's binary_logloss: 0.471696\n",
- "[55]\tvalid_0's auc: 0.812164\tvalid_0's binary_logloss: 0.470905\n",
- "[56]\tvalid_0's auc: 0.812462\tvalid_0's binary_logloss: 0.470384\n",
- "[57]\tvalid_0's auc: 0.812613\tvalid_0's binary_logloss: 0.4696\n",
- "[58]\tvalid_0's auc: 0.812615\tvalid_0's binary_logloss: 0.468778\n",
- "[59]\tvalid_0's auc: 0.812842\tvalid_0's binary_logloss: 0.468211\n",
- "[60]\tvalid_0's auc: 0.81312\tvalid_0's binary_logloss: 0.467385\n",
- "[61]\tvalid_0's auc: 0.813039\tvalid_0's binary_logloss: 0.466632\n",
- "[62]\tvalid_0's auc: 0.812942\tvalid_0's binary_logloss: 0.465933\n",
- "[63]\tvalid_0's auc: 0.813274\tvalid_0's binary_logloss: 0.465214\n",
- "[64]\tvalid_0's auc: 0.813572\tvalid_0's binary_logloss: 0.464692\n",
- "[65]\tvalid_0's auc: 0.813594\tvalid_0's binary_logloss: 0.463925\n",
- "[66]\tvalid_0's auc: 0.813719\tvalid_0's binary_logloss: 0.463177\n",
- "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.462513\n",
- "[68]\tvalid_0's auc: 0.813989\tvalid_0's binary_logloss: 0.461843\n"
- ]
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:16.136151Z",
+ "start_time": "2020-11-18T04:21:16.124444Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序模型定义\n",
+ "lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) "
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[69]\tvalid_0's auc: 0.814218\tvalid_0's binary_logloss: 0.461443\n",
- "[70]\tvalid_0's auc: 0.814334\tvalid_0's binary_logloss: 0.460775\n",
- "[71]\tvalid_0's auc: 0.814493\tvalid_0's binary_logloss: 0.460332\n",
- "[72]\tvalid_0's auc: 0.814663\tvalid_0's binary_logloss: 0.459867\n",
- "[73]\tvalid_0's auc: 0.814856\tvalid_0's binary_logloss: 0.459266\n",
- "[74]\tvalid_0's auc: 0.815017\tvalid_0's binary_logloss: 0.458585\n",
- "[75]\tvalid_0's auc: 0.815186\tvalid_0's binary_logloss: 0.457958\n",
- "[76]\tvalid_0's auc: 0.815374\tvalid_0's binary_logloss: 0.457316\n",
- "[77]\tvalid_0's auc: 0.81554\tvalid_0's binary_logloss: 0.45665\n",
- "[78]\tvalid_0's auc: 0.81569\tvalid_0's binary_logloss: 0.456217\n",
- "[79]\tvalid_0's auc: 0.815861\tvalid_0's binary_logloss: 0.455615\n",
- "[80]\tvalid_0's auc: 0.816443\tvalid_0's binary_logloss: 0.454895\n",
- "[81]\tvalid_0's auc: 0.816659\tvalid_0's binary_logloss: 0.454503\n",
- "[82]\tvalid_0's auc: 0.817017\tvalid_0's binary_logloss: 0.454149\n",
- "[83]\tvalid_0's auc: 0.817162\tvalid_0's binary_logloss: 0.453578\n",
- "[84]\tvalid_0's auc: 0.817274\tvalid_0's binary_logloss: 0.452984\n",
- "[85]\tvalid_0's auc: 0.817283\tvalid_0's binary_logloss: 0.452416\n",
- "[86]\tvalid_0's auc: 0.817339\tvalid_0's binary_logloss: 0.452022\n",
- "[87]\tvalid_0's auc: 0.817494\tvalid_0's binary_logloss: 0.45146\n",
- "[88]\tvalid_0's auc: 0.817594\tvalid_0's binary_logloss: 0.450926\n",
- "[89]\tvalid_0's auc: 0.817771\tvalid_0's binary_logloss: 0.450553\n",
- "[90]\tvalid_0's auc: 0.81789\tvalid_0's binary_logloss: 0.449985\n",
- "[91]\tvalid_0's auc: 0.817931\tvalid_0's binary_logloss: 0.449439\n",
- "[92]\tvalid_0's auc: 0.818138\tvalid_0's binary_logloss: 0.449094\n",
- "[93]\tvalid_0's auc: 0.818334\tvalid_0's binary_logloss: 0.448527\n",
- "[94]\tvalid_0's auc: 0.818426\tvalid_0's binary_logloss: 0.447989\n",
- "[95]\tvalid_0's auc: 0.818676\tvalid_0's binary_logloss: 0.447407\n",
- "[96]\tvalid_0's auc: 0.818852\tvalid_0's binary_logloss: 0.446884\n",
- "[97]\tvalid_0's auc: 0.81945\tvalid_0's binary_logloss: 0.446455\n",
- "[98]\tvalid_0's auc: 0.819861\tvalid_0's binary_logloss: 0.446045\n",
- "[99]\tvalid_0's auc: 0.819943\tvalid_0's binary_logloss: 0.445543\n",
- "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
- "[1]\tvalid_0's auc: 0.770032\tvalid_0's binary_logloss: 0.527241\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.779881\tvalid_0's binary_logloss: 0.525545\n",
- "[3]\tvalid_0's auc: 0.791308\tvalid_0's binary_logloss: 0.524508\n",
- "[4]\tvalid_0's auc: 0.790788\tvalid_0's binary_logloss: 0.52341\n",
- "[5]\tvalid_0's auc: 0.795645\tvalid_0's binary_logloss: 0.521753\n",
- "[6]\tvalid_0's auc: 0.797745\tvalid_0's binary_logloss: 0.520131\n",
- "[7]\tvalid_0's auc: 0.79931\tvalid_0's binary_logloss: 0.518872\n",
- "[8]\tvalid_0's auc: 0.800014\tvalid_0's binary_logloss: 0.517353\n",
- "[9]\tvalid_0's auc: 0.800549\tvalid_0's binary_logloss: 0.516487\n",
- "[10]\tvalid_0's auc: 0.800261\tvalid_0's binary_logloss: 0.515039\n",
- "[11]\tvalid_0's auc: 0.801261\tvalid_0's binary_logloss: 0.513695\n",
- "[12]\tvalid_0's auc: 0.801062\tvalid_0's binary_logloss: 0.512735\n",
- "[13]\tvalid_0's auc: 0.801155\tvalid_0's binary_logloss: 0.51192\n",
- "[14]\tvalid_0's auc: 0.801315\tvalid_0's binary_logloss: 0.510559\n",
- "[15]\tvalid_0's auc: 0.80185\tvalid_0's binary_logloss: 0.509147\n",
- "[16]\tvalid_0's auc: 0.803029\tvalid_0's binary_logloss: 0.507914\n",
- "[17]\tvalid_0's auc: 0.803035\tvalid_0's binary_logloss: 0.506583\n",
- "[18]\tvalid_0's auc: 0.803433\tvalid_0's binary_logloss: 0.505441\n",
- "[19]\tvalid_0's auc: 0.803717\tvalid_0's binary_logloss: 0.504599\n",
- "[20]\tvalid_0's auc: 0.803819\tvalid_0's binary_logloss: 0.503327\n",
- "[21]\tvalid_0's auc: 0.803923\tvalid_0's binary_logloss: 0.502782\n",
- "[22]\tvalid_0's auc: 0.804939\tvalid_0's binary_logloss: 0.501596\n",
- "[23]\tvalid_0's auc: 0.804707\tvalid_0's binary_logloss: 0.500572\n",
- "[24]\tvalid_0's auc: 0.804632\tvalid_0's binary_logloss: 0.499367\n",
- "[25]\tvalid_0's auc: 0.804756\tvalid_0's binary_logloss: 0.498161\n",
- "[26]\tvalid_0's auc: 0.805067\tvalid_0's binary_logloss: 0.497061\n",
- "[27]\tvalid_0's auc: 0.805119\tvalid_0's binary_logloss: 0.495933\n",
- "[28]\tvalid_0's auc: 0.805304\tvalid_0's binary_logloss: 0.494849\n",
- "[29]\tvalid_0's auc: 0.805688\tvalid_0's binary_logloss: 0.493677\n",
- "[30]\tvalid_0's auc: 0.805822\tvalid_0's binary_logloss: 0.492594\n",
- "[31]\tvalid_0's auc: 0.805869\tvalid_0's binary_logloss: 0.49152\n",
- "[32]\tvalid_0's auc: 0.807267\tvalid_0's binary_logloss: 0.490435\n",
- "[33]\tvalid_0's auc: 0.807301\tvalid_0's binary_logloss: 0.489392\n",
- "[34]\tvalid_0's auc: 0.80736\tvalid_0's binary_logloss: 0.488325\n",
- "[35]\tvalid_0's auc: 0.807706\tvalid_0's binary_logloss: 0.487654\n",
- "[36]\tvalid_0's auc: 0.807758\tvalid_0's binary_logloss: 0.486651\n",
- "[37]\tvalid_0's auc: 0.808051\tvalid_0's binary_logloss: 0.486012\n",
- "[38]\tvalid_0's auc: 0.808429\tvalid_0's binary_logloss: 0.485355\n",
- "[39]\tvalid_0's auc: 0.808663\tvalid_0's binary_logloss: 0.484327\n",
- "[40]\tvalid_0's auc: 0.809007\tvalid_0's binary_logloss: 0.483386\n",
- "[41]\tvalid_0's auc: 0.809781\tvalid_0's binary_logloss: 0.482745\n",
- "[42]\tvalid_0's auc: 0.810071\tvalid_0's binary_logloss: 0.482124\n",
- "[43]\tvalid_0's auc: 0.810383\tvalid_0's binary_logloss: 0.481154\n",
- "[44]\tvalid_0's auc: 0.810446\tvalid_0's binary_logloss: 0.480243\n",
- "[45]\tvalid_0's auc: 0.811148\tvalid_0's binary_logloss: 0.479261\n",
- "[46]\tvalid_0's auc: 0.811245\tvalid_0's binary_logloss: 0.478687\n",
- "[47]\tvalid_0's auc: 0.811214\tvalid_0's binary_logloss: 0.477812\n",
- "[48]\tvalid_0's auc: 0.811408\tvalid_0's binary_logloss: 0.47689\n",
- "[49]\tvalid_0's auc: 0.811486\tvalid_0's binary_logloss: 0.476132\n",
- "[50]\tvalid_0's auc: 0.811806\tvalid_0's binary_logloss: 0.475718\n",
- "[51]\tvalid_0's auc: 0.812017\tvalid_0's binary_logloss: 0.475342\n",
- "[52]\tvalid_0's auc: 0.812255\tvalid_0's binary_logloss: 0.474505\n",
- "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.473707\n",
- "[54]\tvalid_0's auc: 0.812235\tvalid_0's binary_logloss: 0.47289\n",
- "[55]\tvalid_0's auc: 0.812233\tvalid_0's binary_logloss: 0.472091\n",
- "[56]\tvalid_0's auc: 0.812492\tvalid_0's binary_logloss: 0.471563\n",
- "[57]\tvalid_0's auc: 0.812579\tvalid_0's binary_logloss: 0.47077\n",
- "[58]\tvalid_0's auc: 0.812598\tvalid_0's binary_logloss: 0.469992\n",
- "[59]\tvalid_0's auc: 0.812885\tvalid_0's binary_logloss: 0.469458\n",
- "[60]\tvalid_0's auc: 0.812995\tvalid_0's binary_logloss: 0.468676\n",
- "[61]\tvalid_0's auc: 0.812961\tvalid_0's binary_logloss: 0.467939\n",
- "[62]\tvalid_0's auc: 0.812919\tvalid_0's binary_logloss: 0.467232\n",
- "[63]\tvalid_0's auc: 0.813291\tvalid_0's binary_logloss: 0.466491\n",
- "[64]\tvalid_0's auc: 0.813702\tvalid_0's binary_logloss: 0.465945\n",
- "[65]\tvalid_0's auc: 0.813803\tvalid_0's binary_logloss: 0.465197\n",
- "[66]\tvalid_0's auc: 0.813851\tvalid_0's binary_logloss: 0.4645\n",
- "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.463814\n",
- "[68]\tvalid_0's auc: 0.814027\tvalid_0's binary_logloss: 0.463113\n",
- "[69]\tvalid_0's auc: 0.814138\tvalid_0's binary_logloss: 0.462727\n",
- "[70]\tvalid_0's auc: 0.814365\tvalid_0's binary_logloss: 0.462077\n",
- "[71]\tvalid_0's auc: 0.814432\tvalid_0's binary_logloss: 0.461655\n",
- "[72]\tvalid_0's auc: 0.8146\tvalid_0's binary_logloss: 0.461194\n",
- "[73]\tvalid_0's auc: 0.815324\tvalid_0's binary_logloss: 0.460477\n",
- "[74]\tvalid_0's auc: 0.815411\tvalid_0's binary_logloss: 0.459805\n",
- "[75]\tvalid_0's auc: 0.815548\tvalid_0's binary_logloss: 0.459189\n",
- "[76]\tvalid_0's auc: 0.815625\tvalid_0's binary_logloss: 0.458525\n",
- "[77]\tvalid_0's auc: 0.81562\tvalid_0's binary_logloss: 0.457905\n",
- "[78]\tvalid_0's auc: 0.815786\tvalid_0's binary_logloss: 0.45747\n",
- "[79]\tvalid_0's auc: 0.815834\tvalid_0's binary_logloss: 0.456884\n",
- "[80]\tvalid_0's auc: 0.816475\tvalid_0's binary_logloss: 0.45617\n",
- "[81]\tvalid_0's auc: 0.816677\tvalid_0's binary_logloss: 0.455787\n",
- "[82]\tvalid_0's auc: 0.817255\tvalid_0's binary_logloss: 0.455358\n",
- "[83]\tvalid_0's auc: 0.817383\tvalid_0's binary_logloss: 0.454775\n",
- "[84]\tvalid_0's auc: 0.817509\tvalid_0's binary_logloss: 0.454176\n",
- "[85]\tvalid_0's auc: 0.817572\tvalid_0's binary_logloss: 0.453609\n",
- "[86]\tvalid_0's auc: 0.817721\tvalid_0's binary_logloss: 0.453213\n",
- "[87]\tvalid_0's auc: 0.817992\tvalid_0's binary_logloss: 0.452586\n",
- "[88]\tvalid_0's auc: 0.81808\tvalid_0's binary_logloss: 0.45204\n",
- "[89]\tvalid_0's auc: 0.818202\tvalid_0's binary_logloss: 0.451643\n",
- "[90]\tvalid_0's auc: 0.818336\tvalid_0's binary_logloss: 0.451081\n",
- "[91]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.450531\n",
- "[92]\tvalid_0's auc: 0.818558\tvalid_0's binary_logloss: 0.450179\n",
- "[93]\tvalid_0's auc: 0.818743\tvalid_0's binary_logloss: 0.449647\n",
- "[94]\tvalid_0's auc: 0.818789\tvalid_0's binary_logloss: 0.449133\n",
- "[95]\tvalid_0's auc: 0.818849\tvalid_0's binary_logloss: 0.44862\n",
- "[96]\tvalid_0's auc: 0.81913\tvalid_0's binary_logloss: 0.448072\n",
- "[97]\tvalid_0's auc: 0.819526\tvalid_0's binary_logloss: 0.447713\n",
- "[98]\tvalid_0's auc: 0.819971\tvalid_0's binary_logloss: 0.447296\n",
- "[99]\tvalid_0's auc: 0.819972\tvalid_0's binary_logloss: 0.446814\n"
- ]
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:22.965433Z",
+ "start_time": "2020-11-18T04:21:17.799127Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 排序模型训练\n",
+ "if offline:\n",
+ " lgb_ranker.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'], group=g_train,\n",
+ " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
+ " eval_group= [g_val], eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
+ "else:\n",
+ " lgb_ranker.fit(trn_user_item_feats_df[lgb_cols], trn_user_item_feats_df['label'], group=g_train)"
+ ]
},
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
- "[1]\tvalid_0's auc: 0.768646\tvalid_0's binary_logloss: 0.527167\n",
- "Training until validation scores don't improve for 50 rounds\n",
- "[2]\tvalid_0's auc: 0.779902\tvalid_0's binary_logloss: 0.525481\n",
- "[3]\tvalid_0's auc: 0.789868\tvalid_0's binary_logloss: 0.524485\n",
- "[4]\tvalid_0's auc: 0.791895\tvalid_0's binary_logloss: 0.523382\n",
- "[5]\tvalid_0's auc: 0.795453\tvalid_0's binary_logloss: 0.521759\n",
- "[6]\tvalid_0's auc: 0.796672\tvalid_0's binary_logloss: 0.520166\n",
- "[7]\tvalid_0's auc: 0.798023\tvalid_0's binary_logloss: 0.518857\n",
- "[8]\tvalid_0's auc: 0.799331\tvalid_0's binary_logloss: 0.517297\n",
- "[9]\tvalid_0's auc: 0.800181\tvalid_0's binary_logloss: 0.516416\n",
- "[10]\tvalid_0's auc: 0.800373\tvalid_0's binary_logloss: 0.514967\n",
- "[11]\tvalid_0's auc: 0.801087\tvalid_0's binary_logloss: 0.513631\n",
- "[12]\tvalid_0's auc: 0.801122\tvalid_0's binary_logloss: 0.512658\n",
- "[13]\tvalid_0's auc: 0.801043\tvalid_0's binary_logloss: 0.511833\n",
- "[14]\tvalid_0's auc: 0.801238\tvalid_0's binary_logloss: 0.510461\n",
- "[15]\tvalid_0's auc: 0.801847\tvalid_0's binary_logloss: 0.509034\n",
- "[16]\tvalid_0's auc: 0.803139\tvalid_0's binary_logloss: 0.507759\n",
- "[17]\tvalid_0's auc: 0.803577\tvalid_0's binary_logloss: 0.506361\n",
- "[18]\tvalid_0's auc: 0.803834\tvalid_0's binary_logloss: 0.505229\n",
- "[19]\tvalid_0's auc: 0.803943\tvalid_0's binary_logloss: 0.504371\n",
- "[20]\tvalid_0's auc: 0.80415\tvalid_0's binary_logloss: 0.503102\n",
- "[21]\tvalid_0's auc: 0.804446\tvalid_0's binary_logloss: 0.502564\n",
- "[22]\tvalid_0's auc: 0.805163\tvalid_0's binary_logloss: 0.501396\n",
- "[23]\tvalid_0's auc: 0.805323\tvalid_0's binary_logloss: 0.500327\n",
- "[24]\tvalid_0's auc: 0.805314\tvalid_0's binary_logloss: 0.499123\n",
- "[25]\tvalid_0's auc: 0.80535\tvalid_0's binary_logloss: 0.497927\n",
- "[26]\tvalid_0's auc: 0.805864\tvalid_0's binary_logloss: 0.496834\n",
- "[27]\tvalid_0's auc: 0.805919\tvalid_0's binary_logloss: 0.495667\n",
- "[28]\tvalid_0's auc: 0.806272\tvalid_0's binary_logloss: 0.494606\n",
- "[29]\tvalid_0's auc: 0.806599\tvalid_0's binary_logloss: 0.49343\n",
- "[30]\tvalid_0's auc: 0.806932\tvalid_0's binary_logloss: 0.492303\n",
- "[31]\tvalid_0's auc: 0.806656\tvalid_0's binary_logloss: 0.491249\n",
- "[32]\tvalid_0's auc: 0.807436\tvalid_0's binary_logloss: 0.490188\n",
- "[33]\tvalid_0's auc: 0.807629\tvalid_0's binary_logloss: 0.489117\n",
- "[34]\tvalid_0's auc: 0.807501\tvalid_0's binary_logloss: 0.48808\n",
- "[35]\tvalid_0's auc: 0.807885\tvalid_0's binary_logloss: 0.487383\n",
- "[36]\tvalid_0's auc: 0.807921\tvalid_0's binary_logloss: 0.48636\n",
- "[37]\tvalid_0's auc: 0.808267\tvalid_0's binary_logloss: 0.485724\n",
- "[38]\tvalid_0's auc: 0.808563\tvalid_0's binary_logloss: 0.485076\n",
- "[39]\tvalid_0's auc: 0.808813\tvalid_0's binary_logloss: 0.484039\n",
- "[40]\tvalid_0's auc: 0.809023\tvalid_0's binary_logloss: 0.483091\n",
- "[41]\tvalid_0's auc: 0.809782\tvalid_0's binary_logloss: 0.482441\n",
- "[42]\tvalid_0's auc: 0.810135\tvalid_0's binary_logloss: 0.48179\n",
- "[43]\tvalid_0's auc: 0.810219\tvalid_0's binary_logloss: 0.48082\n",
- "[44]\tvalid_0's auc: 0.81031\tvalid_0's binary_logloss: 0.479906\n",
- "[45]\tvalid_0's auc: 0.810514\tvalid_0's binary_logloss: 0.479024\n",
- "[46]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.478437\n",
- "[47]\tvalid_0's auc: 0.810611\tvalid_0's binary_logloss: 0.477529\n",
- "[48]\tvalid_0's auc: 0.810781\tvalid_0's binary_logloss: 0.476637\n",
- "[49]\tvalid_0's auc: 0.81089\tvalid_0's binary_logloss: 0.475883\n",
- "[50]\tvalid_0's auc: 0.811266\tvalid_0's binary_logloss: 0.475459\n",
- "[51]\tvalid_0's auc: 0.811402\tvalid_0's binary_logloss: 0.475078\n",
- "[52]\tvalid_0's auc: 0.811765\tvalid_0's binary_logloss: 0.474246\n",
- "[53]\tvalid_0's auc: 0.811891\tvalid_0's binary_logloss: 0.473452\n",
- "[54]\tvalid_0's auc: 0.811868\tvalid_0's binary_logloss: 0.47263\n",
- "[55]\tvalid_0's auc: 0.81192\tvalid_0's binary_logloss: 0.471804\n",
- "[56]\tvalid_0's auc: 0.812272\tvalid_0's binary_logloss: 0.471275\n",
- "[57]\tvalid_0's auc: 0.812639\tvalid_0's binary_logloss: 0.470396\n",
- "[58]\tvalid_0's auc: 0.812764\tvalid_0's binary_logloss: 0.469597\n",
- "[59]\tvalid_0's auc: 0.813084\tvalid_0's binary_logloss: 0.469049\n",
- "[60]\tvalid_0's auc: 0.813342\tvalid_0's binary_logloss: 0.468244\n",
- "[61]\tvalid_0's auc: 0.813302\tvalid_0's binary_logloss: 0.467499\n",
- "[62]\tvalid_0's auc: 0.813221\tvalid_0's binary_logloss: 0.466758\n",
- "[63]\tvalid_0's auc: 0.813697\tvalid_0's binary_logloss: 0.466017\n",
- "[64]\tvalid_0's auc: 0.813985\tvalid_0's binary_logloss: 0.465501\n",
- "[65]\tvalid_0's auc: 0.81416\tvalid_0's binary_logloss: 0.464725\n",
- "[66]\tvalid_0's auc: 0.814227\tvalid_0's binary_logloss: 0.46398\n",
- "[67]\tvalid_0's auc: 0.814397\tvalid_0's binary_logloss: 0.463309\n",
- "[68]\tvalid_0's auc: 0.814426\tvalid_0's binary_logloss: 0.462627\n",
- "[69]\tvalid_0's auc: 0.814593\tvalid_0's binary_logloss: 0.462244\n",
- "[70]\tvalid_0's auc: 0.814789\tvalid_0's binary_logloss: 0.461571\n",
- "[71]\tvalid_0's auc: 0.814889\tvalid_0's binary_logloss: 0.461144\n",
- "[72]\tvalid_0's auc: 0.815078\tvalid_0's binary_logloss: 0.460684\n",
- "[73]\tvalid_0's auc: 0.815439\tvalid_0's binary_logloss: 0.460063\n",
- "[74]\tvalid_0's auc: 0.815511\tvalid_0's binary_logloss: 0.459386\n",
- "[75]\tvalid_0's auc: 0.815574\tvalid_0's binary_logloss: 0.45877\n",
- "[76]\tvalid_0's auc: 0.815634\tvalid_0's binary_logloss: 0.458128\n",
- "[77]\tvalid_0's auc: 0.815618\tvalid_0's binary_logloss: 0.457495\n",
- "[78]\tvalid_0's auc: 0.81582\tvalid_0's binary_logloss: 0.457057\n",
- "[79]\tvalid_0's auc: 0.81594\tvalid_0's binary_logloss: 0.456475\n",
- "[80]\tvalid_0's auc: 0.815961\tvalid_0's binary_logloss: 0.455885\n",
- "[81]\tvalid_0's auc: 0.816153\tvalid_0's binary_logloss: 0.455511\n",
- "[82]\tvalid_0's auc: 0.816433\tvalid_0's binary_logloss: 0.455186\n",
- "[83]\tvalid_0's auc: 0.816546\tvalid_0's binary_logloss: 0.454625\n",
- "[84]\tvalid_0's auc: 0.816586\tvalid_0's binary_logloss: 0.454039\n",
- "[85]\tvalid_0's auc: 0.816584\tvalid_0's binary_logloss: 0.453482\n",
- "[86]\tvalid_0's auc: 0.816881\tvalid_0's binary_logloss: 0.453048\n",
- "[87]\tvalid_0's auc: 0.817029\tvalid_0's binary_logloss: 0.452485\n",
- "[88]\tvalid_0's auc: 0.81707\tvalid_0's binary_logloss: 0.451941\n",
- "[89]\tvalid_0's auc: 0.817298\tvalid_0's binary_logloss: 0.451544\n",
- "[90]\tvalid_0's auc: 0.817343\tvalid_0's binary_logloss: 0.450975\n",
- "[91]\tvalid_0's auc: 0.817357\tvalid_0's binary_logloss: 0.450422\n",
- "[92]\tvalid_0's auc: 0.817592\tvalid_0's binary_logloss: 0.450109\n",
- "[93]\tvalid_0's auc: 0.817729\tvalid_0's binary_logloss: 0.449542\n",
- "[94]\tvalid_0's auc: 0.817834\tvalid_0's binary_logloss: 0.448982\n",
- "[95]\tvalid_0's auc: 0.81809\tvalid_0's binary_logloss: 0.448398\n",
- "[96]\tvalid_0's auc: 0.818269\tvalid_0's binary_logloss: 0.447908\n",
- "[97]\tvalid_0's auc: 0.818682\tvalid_0's binary_logloss: 0.447547\n",
- "[98]\tvalid_0's auc: 0.819015\tvalid_0's binary_logloss: 0.447165\n",
- "[99]\tvalid_0's auc: 0.819016\tvalid_0's binary_logloss: 0.446669\n",
- "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n",
- "Did not meet early stopping. Best iteration is:\n",
- "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n"
- ]
- }
- ],
- "source": [
- "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
- "# 这一部分与前面的单独训练和验证是分开的\n",
- "def get_kfold_users(trn_df, n=5):\n",
- " user_ids = trn_df['user_id'].unique()\n",
- " user_set = [user_ids[i::n] for i in range(n)]\n",
- " return user_set\n",
- "\n",
- "k_fold = 5\n",
- "trn_df = trn_user_item_feats_df_rank_model\n",
- "user_set = get_kfold_users(trn_df, n=k_fold)\n",
- "\n",
- "score_list = []\n",
- "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
- "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
- "\n",
- "# 五折交叉验证,并将中间结果保存用于staking\n",
- "for n_fold, valid_user in enumerate(user_set):\n",
- " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
- " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
- " \n",
- " # 模型及参数的定义\n",
- " lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
- " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
- " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) \n",
- " # 训练模型\n",
- " lgb_Classfication.fit(train_idx[lgb_cols], train_idx['label'],eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], \n",
- " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
- " \n",
- " # 预测验证集结果\n",
- " valid_idx['pred_score'] = lgb_Classfication.predict_proba(valid_idx[lgb_cols], \n",
- " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
- " \n",
- " # 对输出结果进行归一化 分类模型输出的值本身就是一个概率值不需要进行归一化\n",
- " # valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
- " \n",
- " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
- " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
- " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
- " \n",
- " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
- " if not offline:\n",
- " sub_preds += lgb_Classfication.predict_proba(tst_user_item_feats_df_rank_model[lgb_cols], \n",
- " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
- " \n",
- "score_df_ = pd.concat(score_list, axis=0)\n",
- "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
- "# 保存训练集交叉验证产生的新特征\n",
- "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_cls_feats.csv', index=False)\n",
- " \n",
- "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
- "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
- "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
- "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- "\n",
- "# 保存测试集交叉验证的新特征\n",
- "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_cls_feats.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 20,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:24:23.074237Z",
- "start_time": "2020-11-18T04:24:13.812284Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
- "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
- "submit(rank_results, topk=5, model_name='lgb_cls')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## DIN模型"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### 用户的历史点击行为列表\n",
- "这个是为后面的DIN模型服务的"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 21,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:24:30.508213Z",
- "start_time": "2020-11-18T04:24:27.426372Z"
- }
- },
- "outputs": [],
- "source": [
- "if offline:\n",
- " all_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
- "else:\n",
- " trn_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
- " tst_data = pd.read_csv('./data_raw/testA_click_log.csv')\n",
- " all_data = trn_data.append(tst_data)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 22,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:25:28.082071Z",
- "start_time": "2020-11-18T04:24:33.649524Z"
- }
- },
- "outputs": [],
- "source": [
- "hist_click =all_data[['user_id', 'click_article_id']].groupby('user_id').agg({list}).reset_index()\n",
- "his_behavior_df = pd.DataFrame()\n",
- "his_behavior_df['user_id'] = hist_click['user_id']\n",
- "his_behavior_df['hist_click_article_id'] = hist_click['click_article_id']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 23,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:25:52.925866Z",
- "start_time": "2020-11-18T04:25:52.863922Z"
- }
- },
- "outputs": [],
- "source": [
- "trn_user_item_feats_df_din_model = trn_user_item_feats_df.copy()\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_din_model = val_user_item_feats_df.copy()\n",
- "else: \n",
- " val_user_item_feats_df_din_model = None\n",
- " \n",
- "tst_user_item_feats_df_din_model = tst_user_item_feats_df.copy()"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 24,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:00.070681Z",
- "start_time": "2020-11-18T04:25:56.417197Z"
- }
- },
- "outputs": [],
- "source": [
- "trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
- "\n",
- "if offline:\n",
- " val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
- "else:\n",
- " val_user_item_feats_df_din_model = None\n",
- "\n",
- "tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### DIN模型简介\n",
- "我们下面尝试使用DIN模型, DIN的全称是Deep Interest Network, 这是阿里2018年基于前面的深度学习模型无法表达用户多样化的兴趣而提出的一个模型, 它可以通过考虑【给定的候选广告】和【用户的历史行为】的相关性,来计算用户兴趣的表示向量。具体来说就是通过引入局部激活单元,通过软搜索历史行为的相关部分来关注相关的用户兴趣,并采用加权和来获得有关候选广告的用户兴趣的表示。与候选广告相关性较高的行为会获得较高的激活权重,并支配着用户兴趣。该表示向量在不同广告上有所不同,大大提高了模型的表达能力。所以该模型对于此次新闻推荐的任务也比较适合, 我们在这里通过当前的候选文章与用户历史点击文章的相关性来计算用户对于文章的兴趣。 该模型的结构如下:\n",
- "\n",
- "![image-20201116201646983](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)\n",
- "\n",
- "\n",
- "我们这里直接调包来使用这个模型, 关于这个模型的详细细节部分我们会在下一期的推荐系统组队学习中给出。下面说一下该模型如何具体使用:deepctr的函数原型如下:\n",
- "> def DIN(dnn_feature_columns, history_feature_list, dnn_use_bn=False,\n",
- "> dnn_hidden_units=(200, 80), dnn_activation='relu', att_hidden_size=(80, 40), att_activation=\"dice\",\n",
- "> att_weight_normalization=False, l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, seed=1024,\n",
- "> task='binary'):\n",
- "> \n",
- "> * dnn_feature_columns: 特征列, 包含数据所有特征的列表\n",
- "> * history_feature_list: 用户历史行为列, 反应用户历史行为的特征的列表\n",
- "> * dnn_use_bn: 是否使用BatchNormalization\n",
- "> * dnn_hidden_units: 全连接层网络的层数和每一层神经元的个数, 一个列表或者元组\n",
- "> * dnn_activation_relu: 全连接网络的激活单元类型\n",
- "> * att_hidden_size: 注意力层的全连接网络的层数和每一层神经元的个数\n",
- "> * att_activation: 注意力层的激活单元类型\n",
- "> * att_weight_normalization: 是否归一化注意力得分\n",
- "> * l2_reg_dnn: 全连接网络的正则化系数\n",
- "> * l2_reg_embedding: embedding向量的正则化稀疏\n",
- "> * dnn_dropout: 全连接网络的神经元的失活概率\n",
- "> * task: 任务, 可以是分类, 也可是是回归\n",
- "\n",
- "在具体使用的时候, 我们必须要传入特征列和历史行为列, 但是再传入之前, 我们需要进行一下特征列的预处理。具体如下:\n",
- "\n",
- "1. 首先,我们要处理数据集, 得到数据, 由于我们是基于用户过去的行为去预测用户是否点击当前文章, 所以我们需要把数据的特征列划分成数值型特征, 离散型特征和历史行为特征列三部分, 对于每一部分, DIN模型的处理会有不同\n",
- " 1. 对于离散型特征, 在我们的数据集中就是那些类别型的特征, 比如user_id这种, 这种类别型特征, 我们首先要经过embedding处理得到每个特征的低维稠密型表示, 既然要经过embedding, 那么我们就需要为每一列的类别特征的取值建立一个字典,并指明embedding维度, 所以在使用deepctr的DIN模型准备数据的时候, 我们需要通过SparseFeat函数指明这些类别型特征, 这个函数的传入参数就是列名, 列的唯一取值(建立字典用)和embedding维度。\n",
- " 2. 对于用户历史行为特征列, 比如文章id, 文章的类别等这种, 同样的我们需要先经过embedding处理, 只不过和上面不一样的地方是,对于这种特征, 我们在得到每个特征的embedding表示之后, 还需要通过一个Attention_layer计算用户的历史行为和当前候选文章的相关性以此得到当前用户的embedding向量, 这个向量就可以基于当前的候选文章与用户过去点击过得历史文章的相似性的程度来反应用户的兴趣, 并且随着用户的不同的历史点击来变化,去动态的模拟用户兴趣的变化过程。这类特征对于每个用户都是一个历史行为序列, 对于每个用户, 历史行为序列长度会不一样, 可能有的用户点击的历史文章多,有的点击的历史文章少, 所以我们还需要把这个长度统一起来, 在为DIN模型准备数据的时候, 我们首先要通过SparseFeat函数指明这些类别型特征, 然后还需要通过VarLenSparseFeat函数再进行序列填充, 使得每个用户的历史序列一样长, 所以这个函数参数中会有个maxlen,来指明序列的最大长度是多少。\n",
- " 3. 对于连续型特征列, 我们只需要用DenseFeat函数来指明列名和维度即可。\n",
- "2. 处理完特征列之后, 我们把相应的数据与列进行对应,就得到了最后的数据。\n",
- "\n",
- "下面根据具体的代码感受一下, 逻辑是这样, 首先我们需要写一个数据准备函数, 在这里面就是根据上面的具体步骤准备数据, 得到数据和特征列, 然后就是建立DIN模型并训练, 最后基于模型进行测试。"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 25,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:08.405211Z",
- "start_time": "2020-11-18T04:26:04.887013Z"
- }
- },
- "outputs": [],
- "source": [
- "# 导入deepctr\n",
- "from deepctr.models import DIN\n",
- "from deepctr.feature_column import SparseFeat, VarLenSparseFeat, DenseFeat, get_feature_names\n",
- "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
- "\n",
- "from tensorflow.keras import backend as K\n",
- "from tensorflow.keras.layers import *\n",
- "from tensorflow.keras.models import *\n",
- "from tensorflow.keras.callbacks import * \n",
- "import tensorflow as tf\n",
- "\n",
- "import os\n",
- "os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
- "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"2\""
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 26,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:13.485712Z",
- "start_time": "2020-11-18T04:26:13.476042Z"
- }
- },
- "outputs": [],
- "source": [
- "# 数据准备函数\n",
- "def get_din_feats_columns(df, dense_fea, sparse_fea, behavior_fea, his_behavior_fea, emb_dim=32, max_len=100):\n",
- " \"\"\"\n",
- " 数据准备函数:\n",
- " df: 数据集\n",
- " dense_fea: 数值型特征列\n",
- " sparse_fea: 离散型特征列\n",
- " behavior_fea: 用户的候选行为特征列\n",
- " his_behavior_fea: 用户的历史行为特征列\n",
- " embedding_dim: embedding的维度, 这里为了简单, 统一把离散型特征列采用一样的隐向量维度\n",
- " max_len: 用户序列的最大长度\n",
- " \"\"\"\n",
- " \n",
- " sparse_feature_columns = [SparseFeat(feat, vocabulary_size=df[feat].nunique() + 1, embedding_dim=emb_dim) for feat in sparse_fea]\n",
- " \n",
- " dense_feature_columns = [DenseFeat(feat, 1, ) for feat in dense_fea]\n",
- " \n",
- " var_feature_columns = [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=df['click_article_id'].nunique() + 1,\n",
- " embedding_dim=emb_dim, embedding_name='click_article_id'), maxlen=max_len) for feat in hist_behavior_fea]\n",
- " \n",
- " dnn_feature_columns = sparse_feature_columns + dense_feature_columns + var_feature_columns\n",
- " \n",
- " # 建立x, x是一个字典的形式\n",
- " x = {}\n",
- " for name in get_feature_names(dnn_feature_columns):\n",
- " if name in his_behavior_fea:\n",
- " # 这是历史行为序列\n",
- " his_list = [l for l in df[name]]\n",
- " x[name] = pad_sequences(his_list, maxlen=max_len, padding='post') # 二维数组\n",
- " else:\n",
- " x[name] = df[name].values\n",
- " \n",
- " return x, dnn_feature_columns"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 27,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:18.783217Z",
- "start_time": "2020-11-18T04:26:18.776795Z"
- }
- },
- "outputs": [],
- "source": [
- "# 把特征分开\n",
- "sparse_fea = ['user_id', 'click_article_id', 'category_id', 'click_environment', 'click_deviceGroup', \n",
- " 'click_os', 'click_country', 'click_region', 'click_referrer_type', 'is_cat_hab']\n",
- "\n",
- "behavior_fea = ['click_article_id']\n",
- "\n",
- "hist_behavior_fea = ['hist_click_article_id']\n",
- "\n",
- "dense_fea = ['sim0', 'time_diff0', 'word_diff0', 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score',\n",
- " 'rank','click_size','time_diff_mean','active_level','user_time_hob1','user_time_hob2',\n",
- " 'words_hbo','words_count']"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 28,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:25.469810Z",
- "start_time": "2020-11-18T04:26:24.779347Z"
- }
- },
- "outputs": [],
- "source": [
- "# dense特征进行归一化, 神经网络训练都需要将数值进行归一化处理\n",
- "mm = MinMaxScaler()\n",
- "\n",
- "# 下面是做一些特殊处理,当在其他的地方出现无效值的时候,不处理无法进行归一化,刚开始可以先把他注释掉,在运行了下面的代码\n",
- "# 之后如果发现报错,应该先去想办法处理如何不出现inf之类的值\n",
- "# trn_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
- "# tst_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
- "\n",
- "for feat in dense_fea:\n",
- " trn_user_item_feats_df_din_model[feat] = mm.fit_transform(trn_user_item_feats_df_din_model[[feat]])\n",
- " \n",
- " if val_user_item_feats_df_din_model is not None:\n",
- " val_user_item_feats_df_din_model[feat] = mm.fit_transform(val_user_item_feats_df_din_model[[feat]])\n",
- " \n",
- " tst_user_item_feats_df_din_model[feat] = mm.fit_transform(tst_user_item_feats_df_din_model[[feat]])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 29,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:36.727753Z",
- "start_time": "2020-11-18T04:26:28.854705Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:28.616665Z",
+ "start_time": "2020-11-18T04:21:24.672280Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型预测\n",
+ "tst_user_item_feats_df['pred_score'] = lgb_ranker.predict(tst_user_item_feats_df[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
+ "\n",
+ "# 将这里的排序结果保存一份,用户后面的模型融合\n",
+ "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_ranker_score.csv', index=False)"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n"
- ]
- }
- ],
- "source": [
- "# 准备训练数据\n",
- "x_trn, dnn_feature_columns = get_din_feats_columns(trn_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- "y_trn = trn_user_item_feats_df_din_model['label'].values\n",
- "\n",
- "if offline:\n",
- " # 准备验证数据\n",
- " x_val, dnn_feature_columns = get_din_feats_columns(val_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- " y_val = val_user_item_feats_df_din_model['label'].values\n",
- " \n",
- "dense_fea = [x for x in dense_fea if x != 'label']\n",
- "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 30,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:26:45.146318Z",
- "start_time": "2020-11-18T04:26:40.423914Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:21:40.253692Z",
+ "start_time": "2020-11-18T04:21:30.546587Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_ranker')"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
- "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:255: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n",
- "Instructions for updating:\n",
- "Use tf.where in 2.0, which has the same broadcast rule as np.where\n",
- "Model: \"model\"\n",
- "__________________________________________________________________________________________________\n",
- "Layer (type) Output Shape Param # Connected to \n",
- "==================================================================================================\n",
- "user_id (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_article_id (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "category_id (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_environment (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_deviceGroup (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_os (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_country (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_region (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_referrer_type (InputLayer [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "is_cat_hab (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_user_id (Embedding) (None, 1, 32) 1600032 user_id[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_seq_emb_hist_click_artic multiple 525664 click_article_id[0][0] \n",
- " hist_click_article_id[0][0] \n",
- " click_article_id[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_category_id (Embeddi (None, 1, 32) 7776 category_id[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_environment (E (None, 1, 32) 128 click_environment[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_deviceGroup (E (None, 1, 32) 160 click_deviceGroup[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_os (Embedding) (None, 1, 32) 288 click_os[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_country (Embed (None, 1, 32) 384 click_country[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_region (Embedd (None, 1, 32) 928 click_region[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_click_referrer_type (None, 1, 32) 256 click_referrer_type[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "sparse_emb_is_cat_hab (Embeddin (None, 1, 32) 64 is_cat_hab[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask (NoMask) (None, 1, 32) 0 sparse_emb_user_id[0][0] \n",
- " sparse_seq_emb_hist_click_article\n",
- " sparse_emb_category_id[0][0] \n",
- " sparse_emb_click_environment[0][0\n",
- " sparse_emb_click_deviceGroup[0][0\n",
- " sparse_emb_click_os[0][0] \n",
- " sparse_emb_click_country[0][0] \n",
- " sparse_emb_click_region[0][0] \n",
- " sparse_emb_click_referrer_type[0]\n",
- " sparse_emb_is_cat_hab[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "hist_click_article_id (InputLay [(None, 50)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "concatenate (Concatenate) (None, 1, 320) 0 no_mask[0][0] \n",
- " no_mask[1][0] \n",
- " no_mask[2][0] \n",
- " no_mask[3][0] \n",
- " no_mask[4][0] \n",
- " no_mask[5][0] \n",
- " no_mask[6][0] \n",
- " no_mask[7][0] \n",
- " no_mask[8][0] \n",
- " no_mask[9][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_1 (NoMask) (None, 1, 320) 0 concatenate[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "attention_sequence_pooling_laye (None, 1, 32) 13961 sparse_seq_emb_hist_click_article\n",
- " sparse_seq_emb_hist_click_article\n",
- "__________________________________________________________________________________________________\n",
- "concatenate_1 (Concatenate) (None, 1, 352) 0 no_mask_1[0][0] \n",
- " attention_sequence_pooling_layer[\n",
- "__________________________________________________________________________________________________\n",
- "sim0 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "time_diff0 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "word_diff0 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_max (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_min (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_sum (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "sim_mean (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "score (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "rank (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "click_size (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "time_diff_mean (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "active_level (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "user_time_hob1 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "user_time_hob2 (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "words_hbo (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "words_count (InputLayer) [(None, 1)] 0 \n",
- "__________________________________________________________________________________________________\n",
- "flatten (Flatten) (None, 352) 0 concatenate_1[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_3 (NoMask) (None, 1) 0 sim0[0][0] \n",
- " time_diff0[0][0] \n",
- " word_diff0[0][0] \n",
- " sim_max[0][0] \n",
- " sim_min[0][0] \n",
- " sim_sum[0][0] \n",
- " sim_mean[0][0] \n",
- " score[0][0] \n",
- " rank[0][0] \n",
- " click_size[0][0] \n",
- " time_diff_mean[0][0] \n",
- " active_level[0][0] \n",
- " user_time_hob1[0][0] \n",
- " user_time_hob2[0][0] \n",
- " words_hbo[0][0] \n",
- " words_count[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_2 (NoMask) (None, 352) 0 flatten[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "concatenate_2 (Concatenate) (None, 16) 0 no_mask_3[0][0] \n",
- " no_mask_3[1][0] \n",
- " no_mask_3[2][0] \n",
- " no_mask_3[3][0] \n",
- " no_mask_3[4][0] \n",
- " no_mask_3[5][0] \n",
- " no_mask_3[6][0] \n",
- " no_mask_3[7][0] \n",
- " no_mask_3[8][0] \n",
- " no_mask_3[9][0] \n",
- " no_mask_3[10][0] \n",
- " no_mask_3[11][0] \n",
- " no_mask_3[12][0] \n",
- " no_mask_3[13][0] \n",
- " no_mask_3[14][0] \n",
- " no_mask_3[15][0] \n",
- "__________________________________________________________________________________________________\n",
- "flatten_1 (Flatten) (None, 352) 0 no_mask_2[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "flatten_2 (Flatten) (None, 16) 0 concatenate_2[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "no_mask_4 (NoMask) multiple 0 flatten_1[0][0] \n",
- " flatten_2[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "concatenate_3 (Concatenate) (None, 368) 0 no_mask_4[0][0] \n",
- " no_mask_4[1][0] \n",
- "__________________________________________________________________________________________________\n",
- "dnn_1 (DNN) (None, 80) 89880 concatenate_3[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "dense (Dense) (None, 1) 80 dnn_1[0][0] \n",
- "__________________________________________________________________________________________________\n",
- "prediction_layer (PredictionLay (None, 1) 1 dense[0][0] \n",
- "==================================================================================================\n",
- "Total params: 2,239,602\n",
- "Trainable params: 2,239,362\n",
- "Non-trainable params: 240\n",
- "__________________________________________________________________________________________________\n"
- ]
- }
- ],
- "source": [
- "# 建立模型\n",
- "model = DIN(dnn_feature_columns, behavior_fea)\n",
- "\n",
- "# 查看模型结构\n",
- "model.summary()\n",
- "\n",
- "# 模型编译\n",
- "model.compile('adam', 'binary_crossentropy',metrics=['binary_crossentropy', tf.keras.metrics.AUC()])"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 31,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:28:43.885773Z",
- "start_time": "2020-11-18T04:26:48.746787Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:22:26.195838Z",
+ "start_time": "2020-11-18T04:21:46.115002Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[1]\tvalid_0's ndcg@1: 0.909975\tvalid_0's ndcg@2: 0.963068\tvalid_0's ndcg@3: 0.96533\tvalid_0's ndcg@4: 0.965729\tvalid_0's ndcg@5: 0.965864\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9143\tvalid_0's ndcg@2: 0.964711\tvalid_0's ndcg@3: 0.966961\tvalid_0's ndcg@4: 0.967338\tvalid_0's ndcg@5: 0.967483\n",
+ "[3]\tvalid_0's ndcg@1: 0.9181\tvalid_0's ndcg@2: 0.966114\tvalid_0's ndcg@3: 0.968289\tvalid_0's ndcg@4: 0.968773\tvalid_0's ndcg@5: 0.96887\n",
+ "[4]\tvalid_0's ndcg@1: 0.925575\tvalid_0's ndcg@2: 0.969093\tvalid_0's ndcg@3: 0.971193\tvalid_0's ndcg@4: 0.971603\tvalid_0's ndcg@5: 0.97169\n",
+ "[5]\tvalid_0's ndcg@1: 0.9267\tvalid_0's ndcg@2: 0.969635\tvalid_0's ndcg@3: 0.97166\tvalid_0's ndcg@4: 0.972037\tvalid_0's ndcg@5: 0.972133\n",
+ "[6]\tvalid_0's ndcg@1: 0.927\tvalid_0's ndcg@2: 0.969682\tvalid_0's ndcg@3: 0.971757\tvalid_0's ndcg@4: 0.972134\tvalid_0's ndcg@5: 0.972231\n",
+ "[7]\tvalid_0's ndcg@1: 0.928825\tvalid_0's ndcg@2: 0.970451\tvalid_0's ndcg@3: 0.972476\tvalid_0's ndcg@4: 0.97282\tvalid_0's ndcg@5: 0.972927\n",
+ "[8]\tvalid_0's ndcg@1: 0.930025\tvalid_0's ndcg@2: 0.970988\tvalid_0's ndcg@3: 0.972951\tvalid_0's ndcg@4: 0.973295\tvalid_0's ndcg@5: 0.973402\n",
+ "[9]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971347\tvalid_0's ndcg@3: 0.973384\tvalid_0's ndcg@4: 0.973707\tvalid_0's ndcg@5: 0.973794\n",
+ "[10]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.971385\tvalid_0's ndcg@3: 0.973372\tvalid_0's ndcg@4: 0.973717\tvalid_0's ndcg@5: 0.973794\n",
+ "[11]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.971433\tvalid_0's ndcg@3: 0.973333\tvalid_0's ndcg@4: 0.973699\tvalid_0's ndcg@5: 0.973767\n",
+ "[12]\tvalid_0's ndcg@1: 0.93145\tvalid_0's ndcg@2: 0.971656\tvalid_0's ndcg@3: 0.973493\tvalid_0's ndcg@4: 0.973881\tvalid_0's ndcg@5: 0.973949\n",
+ "[13]\tvalid_0's ndcg@1: 0.932525\tvalid_0's ndcg@2: 0.971927\tvalid_0's ndcg@3: 0.973839\tvalid_0's ndcg@4: 0.974227\tvalid_0's ndcg@5: 0.974304\n",
+ "[14]\tvalid_0's ndcg@1: 0.932575\tvalid_0's ndcg@2: 0.971898\tvalid_0's ndcg@3: 0.973823\tvalid_0's ndcg@4: 0.974243\tvalid_0's ndcg@5: 0.97432\n",
+ "[15]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972239\tvalid_0's ndcg@3: 0.974189\tvalid_0's ndcg@4: 0.974587\tvalid_0's ndcg@5: 0.974665\n",
+ "[16]\tvalid_0's ndcg@1: 0.933475\tvalid_0's ndcg@2: 0.972309\tvalid_0's ndcg@3: 0.974209\tvalid_0's ndcg@4: 0.974596\tvalid_0's ndcg@5: 0.974674\n",
+ "[17]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972369\tvalid_0's ndcg@3: 0.974307\tvalid_0's ndcg@4: 0.974684\tvalid_0's ndcg@5: 0.974761\n",
+ "[18]\tvalid_0's ndcg@1: 0.9339\tvalid_0's ndcg@2: 0.972497\tvalid_0's ndcg@3: 0.974372\tvalid_0's ndcg@4: 0.974749\tvalid_0's ndcg@5: 0.974836\n",
+ "[19]\tvalid_0's ndcg@1: 0.9345\tvalid_0's ndcg@2: 0.972845\tvalid_0's ndcg@3: 0.974645\tvalid_0's ndcg@4: 0.974979\tvalid_0's ndcg@5: 0.975085\n",
+ "[20]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.97484\tvalid_0's ndcg@4: 0.975174\tvalid_0's ndcg@5: 0.975271\n",
+ "[21]\tvalid_0's ndcg@1: 0.935\tvalid_0's ndcg@2: 0.973092\tvalid_0's ndcg@3: 0.97488\tvalid_0's ndcg@4: 0.975192\tvalid_0's ndcg@5: 0.975289\n",
+ "[22]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.974988\tvalid_0's ndcg@4: 0.975289\tvalid_0's ndcg@5: 0.975386\n",
+ "[23]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974824\tvalid_0's ndcg@4: 0.975136\tvalid_0's ndcg@5: 0.975223\n",
+ "[24]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973274\tvalid_0's ndcg@3: 0.975087\tvalid_0's ndcg@4: 0.975388\tvalid_0's ndcg@5: 0.975475\n",
+ "[25]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973345\tvalid_0's ndcg@3: 0.97512\tvalid_0's ndcg@4: 0.975443\tvalid_0's ndcg@5: 0.97553\n",
+ "[26]\tvalid_0's ndcg@1: 0.93525\tvalid_0's ndcg@2: 0.9732\tvalid_0's ndcg@3: 0.975\tvalid_0's ndcg@4: 0.975313\tvalid_0's ndcg@5: 0.9754\n",
+ "[27]\tvalid_0's ndcg@1: 0.935175\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.974983\tvalid_0's ndcg@4: 0.975295\tvalid_0's ndcg@5: 0.975382\n",
+ "[28]\tvalid_0's ndcg@1: 0.935425\tvalid_0's ndcg@2: 0.973328\tvalid_0's ndcg@3: 0.975041\tvalid_0's ndcg@4: 0.975374\tvalid_0's ndcg@5: 0.975471\n",
+ "[29]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973225\tvalid_0's ndcg@3: 0.974963\tvalid_0's ndcg@4: 0.975297\tvalid_0's ndcg@5: 0.975403\n",
+ "[30]\tvalid_0's ndcg@1: 0.9353\tvalid_0's ndcg@2: 0.973235\tvalid_0's ndcg@3: 0.97501\tvalid_0's ndcg@4: 0.975311\tvalid_0's ndcg@5: 0.975418\n",
+ "[31]\tvalid_0's ndcg@1: 0.9356\tvalid_0's ndcg@2: 0.973361\tvalid_0's ndcg@3: 0.975099\tvalid_0's ndcg@4: 0.975422\tvalid_0's ndcg@5: 0.975528\n",
+ "[32]\tvalid_0's ndcg@1: 0.9364\tvalid_0's ndcg@2: 0.973641\tvalid_0's ndcg@3: 0.975391\tvalid_0's ndcg@4: 0.975714\tvalid_0's ndcg@5: 0.97582\n",
+ "[33]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973751\tvalid_0's ndcg@3: 0.975501\tvalid_0's ndcg@4: 0.975824\tvalid_0's ndcg@5: 0.975931\n",
+ "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.973902\tvalid_0's ndcg@3: 0.975677\tvalid_0's ndcg@4: 0.975989\tvalid_0's ndcg@5: 0.976095\n",
+ "[35]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974105\tvalid_0's ndcg@3: 0.975892\tvalid_0's ndcg@4: 0.976194\tvalid_0's ndcg@5: 0.9763\n",
+ "[36]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974184\tvalid_0's ndcg@3: 0.975984\tvalid_0's ndcg@4: 0.976296\tvalid_0's ndcg@5: 0.976402\n",
+ "[37]\tvalid_0's ndcg@1: 0.93845\tvalid_0's ndcg@2: 0.974366\tvalid_0's ndcg@3: 0.976166\tvalid_0's ndcg@4: 0.976467\tvalid_0's ndcg@5: 0.976574\n",
+ "[38]\tvalid_0's ndcg@1: 0.938925\tvalid_0's ndcg@2: 0.974557\tvalid_0's ndcg@3: 0.976332\tvalid_0's ndcg@4: 0.976655\tvalid_0's ndcg@5: 0.976751\n",
+ "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974471\tvalid_0's ndcg@3: 0.976234\tvalid_0's ndcg@4: 0.976557\tvalid_0's ndcg@5: 0.976653\n",
+ "[40]\tvalid_0's ndcg@1: 0.938325\tvalid_0's ndcg@2: 0.974335\tvalid_0's ndcg@3: 0.97611\tvalid_0's ndcg@4: 0.976433\tvalid_0's ndcg@5: 0.97653\n",
+ "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.974669\tvalid_0's ndcg@3: 0.976431\tvalid_0's ndcg@4: 0.976743\tvalid_0's ndcg@5: 0.97683\n",
+ "[42]\tvalid_0's ndcg@1: 0.939375\tvalid_0's ndcg@2: 0.974833\tvalid_0's ndcg@3: 0.976546\tvalid_0's ndcg@4: 0.976858\tvalid_0's ndcg@5: 0.976945\n",
+ "[43]\tvalid_0's ndcg@1: 0.939625\tvalid_0's ndcg@2: 0.974878\tvalid_0's ndcg@3: 0.976628\tvalid_0's ndcg@4: 0.97694\tvalid_0's ndcg@5: 0.977027\n",
+ "[44]\tvalid_0's ndcg@1: 0.9395\tvalid_0's ndcg@2: 0.974832\tvalid_0's ndcg@3: 0.97657\tvalid_0's ndcg@4: 0.976893\tvalid_0's ndcg@5: 0.97698\n",
+ "[45]\tvalid_0's ndcg@1: 0.939775\tvalid_0's ndcg@2: 0.974949\tvalid_0's ndcg@3: 0.976674\tvalid_0's ndcg@4: 0.976997\tvalid_0's ndcg@5: 0.977084\n",
+ "[46]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.974945\tvalid_0's ndcg@3: 0.976708\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977107\n",
+ "[47]\tvalid_0's ndcg@1: 0.94005\tvalid_0's ndcg@2: 0.975004\tvalid_0's ndcg@3: 0.976766\tvalid_0's ndcg@4: 0.977078\tvalid_0's ndcg@5: 0.977175\n",
+ "[48]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
+ "[49]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975189\tvalid_0's ndcg@3: 0.976939\tvalid_0's ndcg@4: 0.97723\tvalid_0's ndcg@5: 0.977327\n",
+ "[50]\tvalid_0's ndcg@1: 0.9405\tvalid_0's ndcg@2: 0.975264\tvalid_0's ndcg@3: 0.976989\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
+ "[51]\tvalid_0's ndcg@1: 0.941125\tvalid_0's ndcg@2: 0.975526\tvalid_0's ndcg@3: 0.977226\tvalid_0's ndcg@4: 0.977528\tvalid_0's ndcg@5: 0.977605\n",
+ "[52]\tvalid_0's ndcg@1: 0.941\tvalid_0's ndcg@2: 0.97548\tvalid_0's ndcg@3: 0.977193\tvalid_0's ndcg@4: 0.977484\tvalid_0's ndcg@5: 0.977561\n",
+ "[53]\tvalid_0's ndcg@1: 0.9411\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.977259\tvalid_0's ndcg@4: 0.977539\tvalid_0's ndcg@5: 0.977616\n",
+ "[54]\tvalid_0's ndcg@1: 0.9412\tvalid_0's ndcg@2: 0.975712\tvalid_0's ndcg@3: 0.977299\tvalid_0's ndcg@4: 0.97759\tvalid_0's ndcg@5: 0.977667\n",
+ "[55]\tvalid_0's ndcg@1: 0.94155\tvalid_0's ndcg@2: 0.975841\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977719\tvalid_0's ndcg@5: 0.977797\n",
+ "[56]\tvalid_0's ndcg@1: 0.941825\tvalid_0's ndcg@2: 0.975943\tvalid_0's ndcg@3: 0.97753\tvalid_0's ndcg@4: 0.977821\tvalid_0's ndcg@5: 0.977898\n",
+ "[57]\tvalid_0's ndcg@1: 0.9416\tvalid_0's ndcg@2: 0.975891\tvalid_0's ndcg@3: 0.977429\tvalid_0's ndcg@4: 0.977741\tvalid_0's ndcg@5: 0.977818\n",
+ "[58]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977494\tvalid_0's ndcg@4: 0.977795\tvalid_0's ndcg@5: 0.977873\n",
+ "[59]\tvalid_0's ndcg@1: 0.942025\tvalid_0's ndcg@2: 0.975985\tvalid_0's ndcg@3: 0.977547\tvalid_0's ndcg@4: 0.977881\tvalid_0's ndcg@5: 0.977958\n",
+ "[60]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975994\tvalid_0's ndcg@3: 0.977569\tvalid_0's ndcg@4: 0.977892\tvalid_0's ndcg@5: 0.977969\n",
+ "[61]\tvalid_0's ndcg@1: 0.94205\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977559\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.97796\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[62]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976027\tvalid_0's ndcg@3: 0.97764\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.978028\n",
+ "[63]\tvalid_0's ndcg@1: 0.942125\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977622\tvalid_0's ndcg@4: 0.977912\tvalid_0's ndcg@5: 0.977999\n",
+ "[64]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977793\tvalid_0's ndcg@4: 0.978105\tvalid_0's ndcg@5: 0.978192\n",
+ "[65]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976227\tvalid_0's ndcg@3: 0.977802\tvalid_0's ndcg@4: 0.978125\tvalid_0's ndcg@5: 0.978212\n",
+ "[66]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976132\tvalid_0's ndcg@3: 0.977695\tvalid_0's ndcg@4: 0.978018\tvalid_0's ndcg@5: 0.978105\n",
+ "[67]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976092\tvalid_0's ndcg@3: 0.977679\tvalid_0's ndcg@4: 0.978002\tvalid_0's ndcg@5: 0.978089\n",
+ "[68]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976148\tvalid_0's ndcg@3: 0.977698\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.978108\n",
+ "[69]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976123\tvalid_0's ndcg@3: 0.977686\tvalid_0's ndcg@4: 0.978009\tvalid_0's ndcg@5: 0.978096\n",
+ "[70]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976222\tvalid_0's ndcg@3: 0.977785\tvalid_0's ndcg@4: 0.978097\tvalid_0's ndcg@5: 0.978184\n",
+ "[71]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976188\tvalid_0's ndcg@3: 0.977763\tvalid_0's ndcg@4: 0.978075\tvalid_0's ndcg@5: 0.978162\n",
+ "[72]\tvalid_0's ndcg@1: 0.9427\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977809\tvalid_0's ndcg@4: 0.978121\tvalid_0's ndcg@5: 0.978208\n",
+ "[73]\tvalid_0's ndcg@1: 0.9428\tvalid_0's ndcg@2: 0.976255\tvalid_0's ndcg@3: 0.977843\tvalid_0's ndcg@4: 0.978155\tvalid_0's ndcg@5: 0.978242\n",
+ "[74]\tvalid_0's ndcg@1: 0.94295\tvalid_0's ndcg@2: 0.97631\tvalid_0's ndcg@3: 0.977898\tvalid_0's ndcg@4: 0.97821\tvalid_0's ndcg@5: 0.978297\n",
+ "[75]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976329\tvalid_0's ndcg@3: 0.977941\tvalid_0's ndcg@4: 0.978232\tvalid_0's ndcg@5: 0.978319\n",
+ "[76]\tvalid_0's ndcg@1: 0.9433\tvalid_0's ndcg@2: 0.976471\tvalid_0's ndcg@3: 0.978059\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978437\n",
+ "[77]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976416\tvalid_0's ndcg@3: 0.977991\tvalid_0's ndcg@4: 0.978314\tvalid_0's ndcg@5: 0.978381\n",
+ "[78]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976657\tvalid_0's ndcg@3: 0.978194\tvalid_0's ndcg@4: 0.978517\tvalid_0's ndcg@5: 0.978585\n",
+ "[79]\tvalid_0's ndcg@1: 0.94365\tvalid_0's ndcg@2: 0.976663\tvalid_0's ndcg@3: 0.978188\tvalid_0's ndcg@4: 0.978501\tvalid_0's ndcg@5: 0.978578\n",
+ "[80]\tvalid_0's ndcg@1: 0.943725\tvalid_0's ndcg@2: 0.976628\tvalid_0's ndcg@3: 0.978203\tvalid_0's ndcg@4: 0.978515\tvalid_0's ndcg@5: 0.978593\n",
+ "[81]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97672\tvalid_0's ndcg@3: 0.978295\tvalid_0's ndcg@4: 0.978607\tvalid_0's ndcg@5: 0.978685\n",
+ "[82]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978397\tvalid_0's ndcg@4: 0.97872\tvalid_0's ndcg@5: 0.978787\n",
+ "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976788\tvalid_0's ndcg@3: 0.978375\tvalid_0's ndcg@4: 0.978698\tvalid_0's ndcg@5: 0.978766\n",
+ "[84]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.97679\tvalid_0's ndcg@3: 0.97839\tvalid_0's ndcg@4: 0.978702\tvalid_0's ndcg@5: 0.97878\n",
+ "[85]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.976809\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978723\tvalid_0's ndcg@5: 0.9788\n",
+ "[86]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976939\tvalid_0's ndcg@3: 0.978502\tvalid_0's ndcg@4: 0.978814\tvalid_0's ndcg@5: 0.978891\n",
+ "[87]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.976976\tvalid_0's ndcg@3: 0.978551\tvalid_0's ndcg@4: 0.978852\tvalid_0's ndcg@5: 0.97893\n",
+ "[88]\tvalid_0's ndcg@1: 0.944925\tvalid_0's ndcg@2: 0.977102\tvalid_0's ndcg@3: 0.978677\tvalid_0's ndcg@4: 0.978968\tvalid_0's ndcg@5: 0.979045\n",
+ "[89]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978758\tvalid_0's ndcg@4: 0.979048\tvalid_0's ndcg@5: 0.979126\n",
+ "[90]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.978735\tvalid_0's ndcg@4: 0.979026\tvalid_0's ndcg@5: 0.979104\n",
+ "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977208\tvalid_0's ndcg@3: 0.978858\tvalid_0's ndcg@4: 0.979138\tvalid_0's ndcg@5: 0.979215\n",
+ "[92]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.977267\tvalid_0's ndcg@3: 0.978905\tvalid_0's ndcg@4: 0.979174\tvalid_0's ndcg@5: 0.979251\n",
+ "[93]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977193\tvalid_0's ndcg@3: 0.978818\tvalid_0's ndcg@4: 0.979098\tvalid_0's ndcg@5: 0.979176\n",
+ "[94]\tvalid_0's ndcg@1: 0.94545\tvalid_0's ndcg@2: 0.97728\tvalid_0's ndcg@3: 0.97888\tvalid_0's ndcg@4: 0.97916\tvalid_0's ndcg@5: 0.979238\n",
+ "[95]\tvalid_0's ndcg@1: 0.9458\tvalid_0's ndcg@2: 0.977394\tvalid_0's ndcg@3: 0.979006\tvalid_0's ndcg@4: 0.979286\tvalid_0's ndcg@5: 0.979364\n",
+ "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979114\tvalid_0's ndcg@4: 0.979394\tvalid_0's ndcg@5: 0.979472\n",
+ "[97]\tvalid_0's ndcg@1: 0.946475\tvalid_0's ndcg@2: 0.977659\tvalid_0's ndcg@3: 0.979259\tvalid_0's ndcg@4: 0.979539\tvalid_0's ndcg@5: 0.979616\n",
+ "[98]\tvalid_0's ndcg@1: 0.94675\tvalid_0's ndcg@2: 0.97776\tvalid_0's ndcg@3: 0.97936\tvalid_0's ndcg@4: 0.979651\tvalid_0's ndcg@5: 0.979719\n",
+ "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
+ "[100]\tvalid_0's ndcg@1: 0.9468\tvalid_0's ndcg@2: 0.977794\tvalid_0's ndcg@3: 0.979369\tvalid_0's ndcg@4: 0.979671\tvalid_0's ndcg@5: 0.979739\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[99]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.977831\tvalid_0's ndcg@3: 0.979419\tvalid_0's ndcg@4: 0.97971\tvalid_0's ndcg@5: 0.979777\n",
+ "[1]\tvalid_0's ndcg@1: 0.909075\tvalid_0's ndcg@2: 0.963019\tvalid_0's ndcg@3: 0.965069\tvalid_0's ndcg@4: 0.965543\tvalid_0's ndcg@5: 0.965601\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9123\tvalid_0's ndcg@2: 0.964273\tvalid_0's ndcg@3: 0.966248\tvalid_0's ndcg@4: 0.966722\tvalid_0's ndcg@5: 0.966789\n",
+ "[3]\tvalid_0's ndcg@1: 0.915075\tvalid_0's ndcg@2: 0.965691\tvalid_0's ndcg@3: 0.967466\tvalid_0's ndcg@4: 0.967854\tvalid_0's ndcg@5: 0.967922\n",
+ "[4]\tvalid_0's ndcg@1: 0.91845\tvalid_0's ndcg@2: 0.967047\tvalid_0's ndcg@3: 0.968735\tvalid_0's ndcg@4: 0.969133\tvalid_0's ndcg@5: 0.969201\n",
+ "[5]\tvalid_0's ndcg@1: 0.92355\tvalid_0's ndcg@2: 0.968961\tvalid_0's ndcg@3: 0.970674\tvalid_0's ndcg@4: 0.97104\tvalid_0's ndcg@5: 0.971098\n",
+ "[6]\tvalid_0's ndcg@1: 0.9253\tvalid_0's ndcg@2: 0.969607\tvalid_0's ndcg@3: 0.971345\tvalid_0's ndcg@4: 0.971689\tvalid_0's ndcg@5: 0.971747\n",
+ "[7]\tvalid_0's ndcg@1: 0.926225\tvalid_0's ndcg@2: 0.969933\tvalid_0's ndcg@3: 0.971708\tvalid_0's ndcg@4: 0.972031\tvalid_0's ndcg@5: 0.972079\n",
+ "[8]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.970104\tvalid_0's ndcg@3: 0.971804\tvalid_0's ndcg@4: 0.972116\tvalid_0's ndcg@5: 0.972184\n",
+ "[9]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970682\tvalid_0's ndcg@3: 0.972307\tvalid_0's ndcg@4: 0.972598\tvalid_0's ndcg@5: 0.972675\n",
+ "[10]\tvalid_0's ndcg@1: 0.92775\tvalid_0's ndcg@2: 0.970653\tvalid_0's ndcg@3: 0.972316\tvalid_0's ndcg@4: 0.972617\tvalid_0's ndcg@5: 0.972685\n",
+ "[11]\tvalid_0's ndcg@1: 0.9283\tvalid_0's ndcg@2: 0.97084\tvalid_0's ndcg@3: 0.97254\tvalid_0's ndcg@4: 0.97281\tvalid_0's ndcg@5: 0.972887\n",
+ "[12]\tvalid_0's ndcg@1: 0.9287\tvalid_0's ndcg@2: 0.971051\tvalid_0's ndcg@3: 0.972701\tvalid_0's ndcg@4: 0.97297\tvalid_0's ndcg@5: 0.973048\n",
+ "[13]\tvalid_0's ndcg@1: 0.9297\tvalid_0's ndcg@2: 0.971389\tvalid_0's ndcg@3: 0.973001\tvalid_0's ndcg@4: 0.973313\tvalid_0's ndcg@5: 0.9734\n",
+ "[14]\tvalid_0's ndcg@1: 0.92955\tvalid_0's ndcg@2: 0.971444\tvalid_0's ndcg@3: 0.972994\tvalid_0's ndcg@4: 0.973284\tvalid_0's ndcg@5: 0.973371\n",
+ "[15]\tvalid_0's ndcg@1: 0.930225\tvalid_0's ndcg@2: 0.97174\tvalid_0's ndcg@3: 0.973253\tvalid_0's ndcg@4: 0.973543\tvalid_0's ndcg@5: 0.97363\n",
+ "[16]\tvalid_0's ndcg@1: 0.930425\tvalid_0's ndcg@2: 0.971798\tvalid_0's ndcg@3: 0.973298\tvalid_0's ndcg@4: 0.97361\tvalid_0's ndcg@5: 0.973698\n",
+ "[17]\tvalid_0's ndcg@1: 0.93125\tvalid_0's ndcg@2: 0.971992\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973903\tvalid_0's ndcg@5: 0.97398\n",
+ "[18]\tvalid_0's ndcg@1: 0.931925\tvalid_0's ndcg@2: 0.972257\tvalid_0's ndcg@3: 0.973845\tvalid_0's ndcg@4: 0.974146\tvalid_0's ndcg@5: 0.974224\n",
+ "[19]\tvalid_0's ndcg@1: 0.932375\tvalid_0's ndcg@2: 0.972376\tvalid_0's ndcg@3: 0.974038\tvalid_0's ndcg@4: 0.974318\tvalid_0's ndcg@5: 0.974376\n",
+ "[20]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.972269\tvalid_0's ndcg@3: 0.973907\tvalid_0's ndcg@4: 0.974187\tvalid_0's ndcg@5: 0.974245\n",
+ "[21]\tvalid_0's ndcg@1: 0.932725\tvalid_0's ndcg@2: 0.972568\tvalid_0's ndcg@3: 0.974181\tvalid_0's ndcg@4: 0.974471\tvalid_0's ndcg@5: 0.974529\n",
+ "[22]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972735\tvalid_0's ndcg@3: 0.974298\tvalid_0's ndcg@4: 0.974599\tvalid_0's ndcg@5: 0.974657\n",
+ "[23]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972642\tvalid_0's ndcg@3: 0.974255\tvalid_0's ndcg@4: 0.974545\tvalid_0's ndcg@5: 0.974594\n",
+ "[24]\tvalid_0's ndcg@1: 0.933175\tvalid_0's ndcg@2: 0.972734\tvalid_0's ndcg@3: 0.974347\tvalid_0's ndcg@4: 0.974638\tvalid_0's ndcg@5: 0.974686\n",
+ "[25]\tvalid_0's ndcg@1: 0.9331\tvalid_0's ndcg@2: 0.972754\tvalid_0's ndcg@3: 0.974366\tvalid_0's ndcg@4: 0.974636\tvalid_0's ndcg@5: 0.974674\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[26]\tvalid_0's ndcg@1: 0.933275\tvalid_0's ndcg@2: 0.972787\tvalid_0's ndcg@3: 0.974424\tvalid_0's ndcg@4: 0.974694\tvalid_0's ndcg@5: 0.974732\n",
+ "[27]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972809\tvalid_0's ndcg@3: 0.974434\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.974732\n",
+ "[28]\tvalid_0's ndcg@1: 0.933625\tvalid_0's ndcg@2: 0.972932\tvalid_0's ndcg@3: 0.974557\tvalid_0's ndcg@4: 0.974826\tvalid_0's ndcg@5: 0.974855\n",
+ "[29]\tvalid_0's ndcg@1: 0.933725\tvalid_0's ndcg@2: 0.972937\tvalid_0's ndcg@3: 0.974587\tvalid_0's ndcg@4: 0.974856\tvalid_0's ndcg@5: 0.974885\n",
+ "[30]\tvalid_0's ndcg@1: 0.93355\tvalid_0's ndcg@2: 0.972873\tvalid_0's ndcg@3: 0.974523\tvalid_0's ndcg@4: 0.974792\tvalid_0's ndcg@5: 0.974821\n",
+ "[31]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973065\tvalid_0's ndcg@3: 0.974753\tvalid_0's ndcg@4: 0.975022\tvalid_0's ndcg@5: 0.975051\n",
+ "[32]\tvalid_0's ndcg@1: 0.93435\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974815\tvalid_0's ndcg@4: 0.975084\tvalid_0's ndcg@5: 0.975113\n",
+ "[33]\tvalid_0's ndcg@1: 0.934475\tvalid_0's ndcg@2: 0.97323\tvalid_0's ndcg@3: 0.974855\tvalid_0's ndcg@4: 0.975135\tvalid_0's ndcg@5: 0.975164\n",
+ "[34]\tvalid_0's ndcg@1: 0.9342\tvalid_0's ndcg@2: 0.973113\tvalid_0's ndcg@3: 0.974738\tvalid_0's ndcg@4: 0.975028\tvalid_0's ndcg@5: 0.975057\n",
+ "[35]\tvalid_0's ndcg@1: 0.93455\tvalid_0's ndcg@2: 0.973258\tvalid_0's ndcg@3: 0.97487\tvalid_0's ndcg@4: 0.975172\tvalid_0's ndcg@5: 0.975201\n",
+ "[36]\tvalid_0's ndcg@1: 0.9344\tvalid_0's ndcg@2: 0.973265\tvalid_0's ndcg@3: 0.974828\tvalid_0's ndcg@4: 0.975129\tvalid_0's ndcg@5: 0.975158\n",
+ "[37]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973438\tvalid_0's ndcg@3: 0.975013\tvalid_0's ndcg@4: 0.975304\tvalid_0's ndcg@5: 0.975323\n",
+ "[38]\tvalid_0's ndcg@1: 0.934975\tvalid_0's ndcg@2: 0.973541\tvalid_0's ndcg@3: 0.975066\tvalid_0's ndcg@4: 0.975367\tvalid_0's ndcg@5: 0.975386\n",
+ "[39]\tvalid_0's ndcg@1: 0.935275\tvalid_0's ndcg@2: 0.973667\tvalid_0's ndcg@3: 0.975192\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975502\n",
+ "[40]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973624\tvalid_0's ndcg@3: 0.975174\tvalid_0's ndcg@4: 0.975454\tvalid_0's ndcg@5: 0.975473\n",
+ "[41]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973686\tvalid_0's ndcg@3: 0.975223\tvalid_0's ndcg@4: 0.975503\tvalid_0's ndcg@5: 0.975522\n",
+ "[42]\tvalid_0's ndcg@1: 0.93545\tvalid_0's ndcg@2: 0.973716\tvalid_0's ndcg@3: 0.975266\tvalid_0's ndcg@4: 0.975546\tvalid_0's ndcg@5: 0.975565\n",
+ "[43]\tvalid_0's ndcg@1: 0.93615\tvalid_0's ndcg@2: 0.974022\tvalid_0's ndcg@3: 0.975534\tvalid_0's ndcg@4: 0.975814\tvalid_0's ndcg@5: 0.975843\n",
+ "[44]\tvalid_0's ndcg@1: 0.936225\tvalid_0's ndcg@2: 0.974112\tvalid_0's ndcg@3: 0.975562\tvalid_0's ndcg@4: 0.975853\tvalid_0's ndcg@5: 0.975882\n",
+ "[45]\tvalid_0's ndcg@1: 0.9365\tvalid_0's ndcg@2: 0.974167\tvalid_0's ndcg@3: 0.975654\tvalid_0's ndcg@4: 0.975945\tvalid_0's ndcg@5: 0.975974\n",
+ "[46]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974206\tvalid_0's ndcg@3: 0.975694\tvalid_0's ndcg@4: 0.975995\tvalid_0's ndcg@5: 0.976024\n",
+ "[47]\tvalid_0's ndcg@1: 0.93685\tvalid_0's ndcg@2: 0.974311\tvalid_0's ndcg@3: 0.975786\tvalid_0's ndcg@4: 0.976077\tvalid_0's ndcg@5: 0.976106\n",
+ "[48]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974408\tvalid_0's ndcg@3: 0.975845\tvalid_0's ndcg@4: 0.976147\tvalid_0's ndcg@5: 0.976185\n",
+ "[49]\tvalid_0's ndcg@1: 0.936975\tvalid_0's ndcg@2: 0.974342\tvalid_0's ndcg@3: 0.975829\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.976159\n",
+ "[50]\tvalid_0's ndcg@1: 0.9371\tvalid_0's ndcg@2: 0.974388\tvalid_0's ndcg@3: 0.97585\tvalid_0's ndcg@4: 0.976152\tvalid_0's ndcg@5: 0.976191\n",
+ "[51]\tvalid_0's ndcg@1: 0.937025\tvalid_0's ndcg@2: 0.974329\tvalid_0's ndcg@3: 0.975841\tvalid_0's ndcg@4: 0.976121\tvalid_0's ndcg@5: 0.97616\n",
+ "[52]\tvalid_0's ndcg@1: 0.9377\tvalid_0's ndcg@2: 0.974578\tvalid_0's ndcg@3: 0.976078\tvalid_0's ndcg@4: 0.976369\tvalid_0's ndcg@5: 0.976407\n",
+ "[53]\tvalid_0's ndcg@1: 0.9378\tvalid_0's ndcg@2: 0.974615\tvalid_0's ndcg@3: 0.976115\tvalid_0's ndcg@4: 0.976405\tvalid_0's ndcg@5: 0.976444\n",
+ "[54]\tvalid_0's ndcg@1: 0.938\tvalid_0's ndcg@2: 0.974689\tvalid_0's ndcg@3: 0.976214\tvalid_0's ndcg@4: 0.976483\tvalid_0's ndcg@5: 0.976521\n",
+ "[55]\tvalid_0's ndcg@1: 0.938225\tvalid_0's ndcg@2: 0.974803\tvalid_0's ndcg@3: 0.976303\tvalid_0's ndcg@4: 0.976572\tvalid_0's ndcg@5: 0.976611\n",
+ "[56]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.9748\tvalid_0's ndcg@3: 0.976275\tvalid_0's ndcg@4: 0.976555\tvalid_0's ndcg@5: 0.976594\n",
+ "[57]\tvalid_0's ndcg@1: 0.938525\tvalid_0's ndcg@2: 0.974914\tvalid_0's ndcg@3: 0.976414\tvalid_0's ndcg@4: 0.976683\tvalid_0's ndcg@5: 0.976722\n",
+ "[58]\tvalid_0's ndcg@1: 0.93875\tvalid_0's ndcg@2: 0.975028\tvalid_0's ndcg@3: 0.976503\tvalid_0's ndcg@4: 0.976773\tvalid_0's ndcg@5: 0.976811\n",
+ "[59]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975198\tvalid_0's ndcg@3: 0.976648\tvalid_0's ndcg@4: 0.976918\tvalid_0's ndcg@5: 0.976956\n",
+ "[60]\tvalid_0's ndcg@1: 0.939025\tvalid_0's ndcg@2: 0.975177\tvalid_0's ndcg@3: 0.976615\tvalid_0's ndcg@4: 0.976884\tvalid_0's ndcg@5: 0.976923\n",
+ "[61]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975205\tvalid_0's ndcg@3: 0.976642\tvalid_0's ndcg@4: 0.976912\tvalid_0's ndcg@5: 0.97695\n",
+ "[62]\tvalid_0's ndcg@1: 0.93965\tvalid_0's ndcg@2: 0.975424\tvalid_0's ndcg@3: 0.976836\tvalid_0's ndcg@4: 0.977116\tvalid_0's ndcg@5: 0.977155\n",
+ "[63]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.975596\tvalid_0's ndcg@3: 0.976996\tvalid_0's ndcg@4: 0.977276\tvalid_0's ndcg@5: 0.977315\n",
+ "[64]\tvalid_0's ndcg@1: 0.940375\tvalid_0's ndcg@2: 0.975723\tvalid_0's ndcg@3: 0.977123\tvalid_0's ndcg@4: 0.977392\tvalid_0's ndcg@5: 0.977431\n",
+ "[65]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977154\tvalid_0's ndcg@4: 0.977423\tvalid_0's ndcg@5: 0.977462\n",
+ "[66]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975744\tvalid_0's ndcg@3: 0.977156\tvalid_0's ndcg@4: 0.977426\tvalid_0's ndcg@5: 0.977464\n",
+ "[67]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.97576\tvalid_0's ndcg@3: 0.977172\tvalid_0's ndcg@4: 0.977431\tvalid_0's ndcg@5: 0.977469\n",
+ "[68]\tvalid_0's ndcg@1: 0.940675\tvalid_0's ndcg@2: 0.975849\tvalid_0's ndcg@3: 0.977249\tvalid_0's ndcg@4: 0.977508\tvalid_0's ndcg@5: 0.977546\n",
+ "[69]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.976017\tvalid_0's ndcg@3: 0.977454\tvalid_0's ndcg@4: 0.977724\tvalid_0's ndcg@5: 0.977762\n",
+ "[70]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.977362\tvalid_0's ndcg@4: 0.977631\tvalid_0's ndcg@5: 0.97767\n",
+ "[71]\tvalid_0's ndcg@1: 0.94105\tvalid_0's ndcg@2: 0.975925\tvalid_0's ndcg@3: 0.97735\tvalid_0's ndcg@4: 0.97763\tvalid_0's ndcg@5: 0.977668\n",
+ "[72]\tvalid_0's ndcg@1: 0.941325\tvalid_0's ndcg@2: 0.976058\tvalid_0's ndcg@3: 0.97747\tvalid_0's ndcg@4: 0.977739\tvalid_0's ndcg@5: 0.977778\n",
+ "[73]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977476\tvalid_0's ndcg@4: 0.977756\tvalid_0's ndcg@5: 0.977795\n",
+ "[74]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.97759\tvalid_0's ndcg@4: 0.97788\tvalid_0's ndcg@5: 0.977919\n",
+ "[75]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.97619\tvalid_0's ndcg@3: 0.977602\tvalid_0's ndcg@4: 0.977882\tvalid_0's ndcg@5: 0.977921\n",
+ "[76]\tvalid_0's ndcg@1: 0.94195\tvalid_0's ndcg@2: 0.976273\tvalid_0's ndcg@3: 0.977685\tvalid_0's ndcg@4: 0.977965\tvalid_0's ndcg@5: 0.978004\n",
+ "[77]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.97627\tvalid_0's ndcg@3: 0.97767\tvalid_0's ndcg@4: 0.97795\tvalid_0's ndcg@5: 0.977989\n",
+ "[78]\tvalid_0's ndcg@1: 0.94235\tvalid_0's ndcg@2: 0.976452\tvalid_0's ndcg@3: 0.977839\tvalid_0's ndcg@4: 0.978119\tvalid_0's ndcg@5: 0.978158\n",
+ "[79]\tvalid_0's ndcg@1: 0.94265\tvalid_0's ndcg@2: 0.976562\tvalid_0's ndcg@3: 0.977937\tvalid_0's ndcg@4: 0.978228\tvalid_0's ndcg@5: 0.978267\n",
+ "[80]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976667\tvalid_0's ndcg@3: 0.978067\tvalid_0's ndcg@4: 0.978347\tvalid_0's ndcg@5: 0.978385\n",
+ "[81]\tvalid_0's ndcg@1: 0.94305\tvalid_0's ndcg@2: 0.97671\tvalid_0's ndcg@3: 0.978098\tvalid_0's ndcg@4: 0.978378\tvalid_0's ndcg@5: 0.978416\n",
+ "[82]\tvalid_0's ndcg@1: 0.943175\tvalid_0's ndcg@2: 0.97674\tvalid_0's ndcg@3: 0.978115\tvalid_0's ndcg@4: 0.978417\tvalid_0's ndcg@5: 0.978456\n",
+ "[83]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976752\tvalid_0's ndcg@3: 0.97814\tvalid_0's ndcg@4: 0.978441\tvalid_0's ndcg@5: 0.97848\n",
+ "[84]\tvalid_0's ndcg@1: 0.943375\tvalid_0's ndcg@2: 0.976767\tvalid_0's ndcg@3: 0.978179\tvalid_0's ndcg@4: 0.978481\tvalid_0's ndcg@5: 0.97852\n",
+ "[85]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976721\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978437\tvalid_0's ndcg@5: 0.978475\n",
+ "[86]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976792\tvalid_0's ndcg@3: 0.978204\tvalid_0's ndcg@4: 0.978506\tvalid_0's ndcg@5: 0.978535\n",
+ "[87]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.976851\tvalid_0's ndcg@3: 0.978239\tvalid_0's ndcg@4: 0.97854\tvalid_0's ndcg@5: 0.978569\n",
+ "[88]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976882\tvalid_0's ndcg@3: 0.978282\tvalid_0's ndcg@4: 0.978572\tvalid_0's ndcg@5: 0.978611\n",
+ "[89]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.976915\tvalid_0's ndcg@3: 0.97834\tvalid_0's ndcg@4: 0.97863\tvalid_0's ndcg@5: 0.978669\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[90]\tvalid_0's ndcg@1: 0.943925\tvalid_0's ndcg@2: 0.976986\tvalid_0's ndcg@3: 0.978398\tvalid_0's ndcg@4: 0.978689\tvalid_0's ndcg@5: 0.978728\n",
+ "[91]\tvalid_0's ndcg@1: 0.943875\tvalid_0's ndcg@2: 0.976999\tvalid_0's ndcg@3: 0.978399\tvalid_0's ndcg@4: 0.978679\tvalid_0's ndcg@5: 0.978717\n",
+ "[92]\tvalid_0's ndcg@1: 0.94395\tvalid_0's ndcg@2: 0.977058\tvalid_0's ndcg@3: 0.978421\tvalid_0's ndcg@4: 0.978711\tvalid_0's ndcg@5: 0.97876\n",
+ "[93]\tvalid_0's ndcg@1: 0.944075\tvalid_0's ndcg@2: 0.977104\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978759\tvalid_0's ndcg@5: 0.978807\n",
+ "[94]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977125\tvalid_0's ndcg@3: 0.978513\tvalid_0's ndcg@4: 0.978793\tvalid_0's ndcg@5: 0.978841\n",
+ "[95]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977153\tvalid_0's ndcg@3: 0.97854\tvalid_0's ndcg@4: 0.97882\tvalid_0's ndcg@5: 0.978869\n",
+ "[96]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977144\tvalid_0's ndcg@3: 0.978531\tvalid_0's ndcg@4: 0.978811\tvalid_0's ndcg@5: 0.97886\n",
+ "[97]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977221\tvalid_0's ndcg@3: 0.978584\tvalid_0's ndcg@4: 0.978864\tvalid_0's ndcg@5: 0.978912\n",
+ "[98]\tvalid_0's ndcg@1: 0.944575\tvalid_0's ndcg@2: 0.977289\tvalid_0's ndcg@3: 0.978651\tvalid_0's ndcg@4: 0.978942\tvalid_0's ndcg@5: 0.97899\n",
+ "[99]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977341\tvalid_0's ndcg@3: 0.978691\tvalid_0's ndcg@4: 0.978993\tvalid_0's ndcg@5: 0.979032\n",
+ "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.9451\tvalid_0's ndcg@2: 0.977482\tvalid_0's ndcg@3: 0.978857\tvalid_0's ndcg@4: 0.979148\tvalid_0's ndcg@5: 0.979187\n",
+ "[1]\tvalid_0's ndcg@1: 0.911575\tvalid_0's ndcg@2: 0.964384\tvalid_0's ndcg@3: 0.966321\tvalid_0's ndcg@4: 0.966623\tvalid_0's ndcg@5: 0.966671\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9136\tvalid_0's ndcg@2: 0.965257\tvalid_0's ndcg@3: 0.967107\tvalid_0's ndcg@4: 0.967398\tvalid_0's ndcg@5: 0.967456\n",
+ "[3]\tvalid_0's ndcg@1: 0.917425\tvalid_0's ndcg@2: 0.966732\tvalid_0's ndcg@3: 0.968545\tvalid_0's ndcg@4: 0.968814\tvalid_0's ndcg@5: 0.968882\n",
+ "[4]\tvalid_0's ndcg@1: 0.9222\tvalid_0's ndcg@2: 0.968558\tvalid_0's ndcg@3: 0.970383\tvalid_0's ndcg@4: 0.970619\tvalid_0's ndcg@5: 0.970668\n",
+ "[5]\tvalid_0's ndcg@1: 0.925875\tvalid_0's ndcg@2: 0.969914\tvalid_0's ndcg@3: 0.971714\tvalid_0's ndcg@4: 0.971972\tvalid_0's ndcg@5: 0.972021\n",
+ "[6]\tvalid_0's ndcg@1: 0.926875\tvalid_0's ndcg@2: 0.970425\tvalid_0's ndcg@3: 0.972112\tvalid_0's ndcg@4: 0.972371\tvalid_0's ndcg@5: 0.972419\n",
+ "[7]\tvalid_0's ndcg@1: 0.927475\tvalid_0's ndcg@2: 0.970631\tvalid_0's ndcg@3: 0.972306\tvalid_0's ndcg@4: 0.972586\tvalid_0's ndcg@5: 0.972634\n",
+ "[8]\tvalid_0's ndcg@1: 0.93015\tvalid_0's ndcg@2: 0.971649\tvalid_0's ndcg@3: 0.973287\tvalid_0's ndcg@4: 0.973567\tvalid_0's ndcg@5: 0.973625\n",
+ "[9]\tvalid_0's ndcg@1: 0.9312\tvalid_0's ndcg@2: 0.972084\tvalid_0's ndcg@3: 0.973684\tvalid_0's ndcg@4: 0.973964\tvalid_0's ndcg@5: 0.974022\n",
+ "[10]\tvalid_0's ndcg@1: 0.93225\tvalid_0's ndcg@2: 0.972456\tvalid_0's ndcg@3: 0.974081\tvalid_0's ndcg@4: 0.974361\tvalid_0's ndcg@5: 0.974409\n",
+ "[11]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.972704\tvalid_0's ndcg@3: 0.974379\tvalid_0's ndcg@4: 0.974648\tvalid_0's ndcg@5: 0.974696\n",
+ "[12]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972949\tvalid_0's ndcg@3: 0.974574\tvalid_0's ndcg@4: 0.974832\tvalid_0's ndcg@5: 0.974881\n",
+ "[13]\tvalid_0's ndcg@1: 0.93415\tvalid_0's ndcg@2: 0.97322\tvalid_0's ndcg@3: 0.97482\tvalid_0's ndcg@4: 0.975079\tvalid_0's ndcg@5: 0.975127\n",
+ "[14]\tvalid_0's ndcg@1: 0.9352\tvalid_0's ndcg@2: 0.973671\tvalid_0's ndcg@3: 0.975246\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975531\n",
+ "[15]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.973877\tvalid_0's ndcg@3: 0.975452\tvalid_0's ndcg@4: 0.975699\tvalid_0's ndcg@5: 0.975748\n",
+ "[16]\tvalid_0's ndcg@1: 0.935825\tvalid_0's ndcg@2: 0.973917\tvalid_0's ndcg@3: 0.975442\tvalid_0's ndcg@4: 0.975712\tvalid_0's ndcg@5: 0.97576\n",
+ "[17]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.97411\tvalid_0's ndcg@3: 0.975697\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975995\n",
+ "[18]\tvalid_0's ndcg@1: 0.936925\tvalid_0's ndcg@2: 0.974292\tvalid_0's ndcg@3: 0.975867\tvalid_0's ndcg@4: 0.976114\tvalid_0's ndcg@5: 0.976163\n",
+ "[19]\tvalid_0's ndcg@1: 0.937525\tvalid_0's ndcg@2: 0.974545\tvalid_0's ndcg@3: 0.976095\tvalid_0's ndcg@4: 0.976342\tvalid_0's ndcg@5: 0.976391\n",
+ "[20]\tvalid_0's ndcg@1: 0.937775\tvalid_0's ndcg@2: 0.974653\tvalid_0's ndcg@3: 0.976203\tvalid_0's ndcg@4: 0.976429\tvalid_0's ndcg@5: 0.976487\n",
+ "[21]\tvalid_0's ndcg@1: 0.938825\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976597\tvalid_0's ndcg@4: 0.976823\tvalid_0's ndcg@5: 0.976881\n",
+ "[22]\tvalid_0's ndcg@1: 0.93885\tvalid_0's ndcg@2: 0.975097\tvalid_0's ndcg@3: 0.976609\tvalid_0's ndcg@4: 0.976846\tvalid_0's ndcg@5: 0.976895\n",
+ "[23]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976733\tvalid_0's ndcg@4: 0.976959\tvalid_0's ndcg@5: 0.977008\n",
+ "[24]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.975246\tvalid_0's ndcg@3: 0.976721\tvalid_0's ndcg@4: 0.976947\tvalid_0's ndcg@5: 0.977005\n",
+ "[25]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975421\tvalid_0's ndcg@3: 0.976909\tvalid_0's ndcg@4: 0.977124\tvalid_0's ndcg@5: 0.977182\n",
+ "[26]\tvalid_0's ndcg@1: 0.9393\tvalid_0's ndcg@2: 0.975342\tvalid_0's ndcg@3: 0.976804\tvalid_0's ndcg@4: 0.97702\tvalid_0's ndcg@5: 0.977078\n",
+ "[27]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975323\tvalid_0's ndcg@3: 0.976798\tvalid_0's ndcg@4: 0.977014\tvalid_0's ndcg@5: 0.977062\n",
+ "[28]\tvalid_0's ndcg@1: 0.93925\tvalid_0's ndcg@2: 0.975308\tvalid_0's ndcg@3: 0.976783\tvalid_0's ndcg@4: 0.977009\tvalid_0's ndcg@5: 0.977057\n",
+ "[29]\tvalid_0's ndcg@1: 0.94\tvalid_0's ndcg@2: 0.975569\tvalid_0's ndcg@3: 0.977056\tvalid_0's ndcg@4: 0.977282\tvalid_0's ndcg@5: 0.977331\n",
+ "[30]\tvalid_0's ndcg@1: 0.940325\tvalid_0's ndcg@2: 0.975673\tvalid_0's ndcg@3: 0.977173\tvalid_0's ndcg@4: 0.977399\tvalid_0's ndcg@5: 0.977447\n",
+ "[31]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975731\tvalid_0's ndcg@3: 0.977243\tvalid_0's ndcg@4: 0.977469\tvalid_0's ndcg@5: 0.977518\n",
+ "[32]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
+ "[33]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975766\tvalid_0's ndcg@3: 0.977241\tvalid_0's ndcg@4: 0.977457\tvalid_0's ndcg@5: 0.977505\n",
+ "[34]\tvalid_0's ndcg@1: 0.940625\tvalid_0's ndcg@2: 0.975831\tvalid_0's ndcg@3: 0.977306\tvalid_0's ndcg@4: 0.977521\tvalid_0's ndcg@5: 0.97757\n",
+ "[35]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975868\tvalid_0's ndcg@3: 0.977343\tvalid_0's ndcg@4: 0.977558\tvalid_0's ndcg@5: 0.977606\n",
+ "[36]\tvalid_0's ndcg@1: 0.94115\tvalid_0's ndcg@2: 0.976056\tvalid_0's ndcg@3: 0.977506\tvalid_0's ndcg@4: 0.977722\tvalid_0's ndcg@5: 0.97777\n",
+ "[37]\tvalid_0's ndcg@1: 0.9414\tvalid_0's ndcg@2: 0.976133\tvalid_0's ndcg@3: 0.977595\tvalid_0's ndcg@4: 0.977811\tvalid_0's ndcg@5: 0.977859\n",
+ "[38]\tvalid_0's ndcg@1: 0.94175\tvalid_0's ndcg@2: 0.976278\tvalid_0's ndcg@3: 0.977715\tvalid_0's ndcg@4: 0.977941\tvalid_0's ndcg@5: 0.97799\n",
+ "[39]\tvalid_0's ndcg@1: 0.942075\tvalid_0's ndcg@2: 0.976366\tvalid_0's ndcg@3: 0.977841\tvalid_0's ndcg@4: 0.978056\tvalid_0's ndcg@5: 0.978105\n",
+ "[40]\tvalid_0's ndcg@1: 0.94215\tvalid_0's ndcg@2: 0.976409\tvalid_0's ndcg@3: 0.977872\tvalid_0's ndcg@4: 0.978087\tvalid_0's ndcg@5: 0.978136\n",
+ "[41]\tvalid_0's ndcg@1: 0.94245\tvalid_0's ndcg@2: 0.97652\tvalid_0's ndcg@3: 0.977983\tvalid_0's ndcg@4: 0.978198\tvalid_0's ndcg@5: 0.978246\n",
+ "[42]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
+ "[43]\tvalid_0's ndcg@1: 0.942975\tvalid_0's ndcg@2: 0.976682\tvalid_0's ndcg@3: 0.97817\tvalid_0's ndcg@4: 0.978385\tvalid_0's ndcg@5: 0.978434\n",
+ "[44]\tvalid_0's ndcg@1: 0.94285\tvalid_0's ndcg@2: 0.976636\tvalid_0's ndcg@3: 0.978111\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978386\n",
+ "[45]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.9768\tvalid_0's ndcg@3: 0.978262\tvalid_0's ndcg@4: 0.978488\tvalid_0's ndcg@5: 0.978537\n",
+ "[46]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
+ "[47]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97836\tvalid_0's ndcg@4: 0.978576\tvalid_0's ndcg@5: 0.978634\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[48]\tvalid_0's ndcg@1: 0.943525\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.978373\tvalid_0's ndcg@4: 0.978577\tvalid_0's ndcg@5: 0.978636\n",
+ "[49]\tvalid_0's ndcg@1: 0.9436\tvalid_0's ndcg@2: 0.976913\tvalid_0's ndcg@3: 0.978388\tvalid_0's ndcg@4: 0.978614\tvalid_0's ndcg@5: 0.978663\n",
+ "[50]\tvalid_0's ndcg@1: 0.943975\tvalid_0's ndcg@2: 0.97702\tvalid_0's ndcg@3: 0.97852\tvalid_0's ndcg@4: 0.978746\tvalid_0's ndcg@5: 0.978794\n",
+ "[51]\tvalid_0's ndcg@1: 0.9441\tvalid_0's ndcg@2: 0.97705\tvalid_0's ndcg@3: 0.97855\tvalid_0's ndcg@4: 0.978787\tvalid_0's ndcg@5: 0.978836\n",
+ "[52]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.977121\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978846\tvalid_0's ndcg@5: 0.978894\n",
+ "[53]\tvalid_0's ndcg@1: 0.944225\tvalid_0's ndcg@2: 0.977081\tvalid_0's ndcg@3: 0.978618\tvalid_0's ndcg@4: 0.978834\tvalid_0's ndcg@5: 0.978882\n",
+ "[54]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977071\tvalid_0's ndcg@3: 0.978609\tvalid_0's ndcg@4: 0.978824\tvalid_0's ndcg@5: 0.978873\n",
+ "[55]\tvalid_0's ndcg@1: 0.94435\tvalid_0's ndcg@2: 0.977143\tvalid_0's ndcg@3: 0.978668\tvalid_0's ndcg@4: 0.978883\tvalid_0's ndcg@5: 0.978931\n",
+ "[56]\tvalid_0's ndcg@1: 0.9444\tvalid_0's ndcg@2: 0.977177\tvalid_0's ndcg@3: 0.978702\tvalid_0's ndcg@4: 0.978906\tvalid_0's ndcg@5: 0.978955\n",
+ "[57]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.977263\tvalid_0's ndcg@3: 0.978788\tvalid_0's ndcg@4: 0.979003\tvalid_0's ndcg@5: 0.979051\n",
+ "[58]\tvalid_0's ndcg@1: 0.9448\tvalid_0's ndcg@2: 0.977293\tvalid_0's ndcg@3: 0.978843\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979096\n",
+ "[59]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977472\tvalid_0's ndcg@3: 0.978997\tvalid_0's ndcg@4: 0.979202\tvalid_0's ndcg@5: 0.97925\n",
+ "[60]\tvalid_0's ndcg@1: 0.9455\tvalid_0's ndcg@2: 0.97763\tvalid_0's ndcg@3: 0.979118\tvalid_0's ndcg@4: 0.979322\tvalid_0's ndcg@5: 0.979371\n",
+ "[61]\tvalid_0's ndcg@1: 0.945725\tvalid_0's ndcg@2: 0.977682\tvalid_0's ndcg@3: 0.979194\tvalid_0's ndcg@4: 0.979399\tvalid_0's ndcg@5: 0.979447\n",
+ "[62]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977812\tvalid_0's ndcg@3: 0.979312\tvalid_0's ndcg@4: 0.979495\tvalid_0's ndcg@5: 0.979543\n",
+ "[63]\tvalid_0's ndcg@1: 0.946\tvalid_0's ndcg@2: 0.977878\tvalid_0's ndcg@3: 0.97934\tvalid_0's ndcg@4: 0.979523\tvalid_0's ndcg@5: 0.979572\n",
+ "[64]\tvalid_0's ndcg@1: 0.946525\tvalid_0's ndcg@2: 0.978056\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979714\tvalid_0's ndcg@5: 0.979762\n",
+ "[65]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.978105\tvalid_0's ndcg@3: 0.979592\tvalid_0's ndcg@4: 0.979775\tvalid_0's ndcg@5: 0.979823\n",
+ "[66]\tvalid_0's ndcg@1: 0.9465\tvalid_0's ndcg@2: 0.978046\tvalid_0's ndcg@3: 0.979534\tvalid_0's ndcg@4: 0.979706\tvalid_0's ndcg@5: 0.979755\n",
+ "[67]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.978127\tvalid_0's ndcg@3: 0.979614\tvalid_0's ndcg@4: 0.979776\tvalid_0's ndcg@5: 0.979824\n",
+ "[68]\tvalid_0's ndcg@1: 0.9467\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.979608\tvalid_0's ndcg@4: 0.97978\tvalid_0's ndcg@5: 0.979828\n",
+ "[69]\tvalid_0's ndcg@1: 0.946875\tvalid_0's ndcg@2: 0.978216\tvalid_0's ndcg@3: 0.979679\tvalid_0's ndcg@4: 0.979851\tvalid_0's ndcg@5: 0.9799\n",
+ "[70]\tvalid_0's ndcg@1: 0.9469\tvalid_0's ndcg@2: 0.978194\tvalid_0's ndcg@3: 0.979682\tvalid_0's ndcg@4: 0.979854\tvalid_0's ndcg@5: 0.979902\n",
+ "[71]\tvalid_0's ndcg@1: 0.947025\tvalid_0's ndcg@2: 0.978209\tvalid_0's ndcg@3: 0.979721\tvalid_0's ndcg@4: 0.979893\tvalid_0's ndcg@5: 0.979942\n",
+ "[72]\tvalid_0's ndcg@1: 0.9472\tvalid_0's ndcg@2: 0.978273\tvalid_0's ndcg@3: 0.979773\tvalid_0's ndcg@4: 0.979956\tvalid_0's ndcg@5: 0.980005\n",
+ "[73]\tvalid_0's ndcg@1: 0.947475\tvalid_0's ndcg@2: 0.978391\tvalid_0's ndcg@3: 0.979878\tvalid_0's ndcg@4: 0.980061\tvalid_0's ndcg@5: 0.980109\n",
+ "[74]\tvalid_0's ndcg@1: 0.94715\tvalid_0's ndcg@2: 0.978271\tvalid_0's ndcg@3: 0.979758\tvalid_0's ndcg@4: 0.979941\tvalid_0's ndcg@5: 0.97999\n",
+ "[75]\tvalid_0's ndcg@1: 0.947275\tvalid_0's ndcg@2: 0.978333\tvalid_0's ndcg@3: 0.979808\tvalid_0's ndcg@4: 0.979991\tvalid_0's ndcg@5: 0.980039\n",
+ "[76]\tvalid_0's ndcg@1: 0.9474\tvalid_0's ndcg@2: 0.97841\tvalid_0's ndcg@3: 0.979873\tvalid_0's ndcg@4: 0.980045\tvalid_0's ndcg@5: 0.980093\n",
+ "[77]\tvalid_0's ndcg@1: 0.94745\tvalid_0's ndcg@2: 0.97846\tvalid_0's ndcg@3: 0.979898\tvalid_0's ndcg@4: 0.98007\tvalid_0's ndcg@5: 0.980118\n",
+ "[78]\tvalid_0's ndcg@1: 0.94775\tvalid_0's ndcg@2: 0.978555\tvalid_0's ndcg@3: 0.980005\tvalid_0's ndcg@4: 0.980177\tvalid_0's ndcg@5: 0.980226\n",
+ "[79]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
+ "[80]\tvalid_0's ndcg@1: 0.947875\tvalid_0's ndcg@2: 0.978617\tvalid_0's ndcg@3: 0.980055\tvalid_0's ndcg@4: 0.980238\tvalid_0's ndcg@5: 0.980276\n",
+ "[81]\tvalid_0's ndcg@1: 0.948175\tvalid_0's ndcg@2: 0.978744\tvalid_0's ndcg@3: 0.980169\tvalid_0's ndcg@4: 0.980352\tvalid_0's ndcg@5: 0.98039\n",
+ "[82]\tvalid_0's ndcg@1: 0.948375\tvalid_0's ndcg@2: 0.97888\tvalid_0's ndcg@3: 0.980255\tvalid_0's ndcg@4: 0.980438\tvalid_0's ndcg@5: 0.980477\n",
+ "[83]\tvalid_0's ndcg@1: 0.94825\tvalid_0's ndcg@2: 0.978834\tvalid_0's ndcg@3: 0.980209\tvalid_0's ndcg@4: 0.980392\tvalid_0's ndcg@5: 0.980431\n",
+ "[84]\tvalid_0's ndcg@1: 0.948275\tvalid_0's ndcg@2: 0.978844\tvalid_0's ndcg@3: 0.980219\tvalid_0's ndcg@4: 0.980402\tvalid_0's ndcg@5: 0.98044\n",
+ "[85]\tvalid_0's ndcg@1: 0.948475\tvalid_0's ndcg@2: 0.978917\tvalid_0's ndcg@3: 0.980292\tvalid_0's ndcg@4: 0.980475\tvalid_0's ndcg@5: 0.980514\n",
+ "[86]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979102\tvalid_0's ndcg@3: 0.980477\tvalid_0's ndcg@4: 0.98066\tvalid_0's ndcg@5: 0.980699\n",
+ "[87]\tvalid_0's ndcg@1: 0.948975\tvalid_0's ndcg@2: 0.979086\tvalid_0's ndcg@3: 0.980474\tvalid_0's ndcg@4: 0.980657\tvalid_0's ndcg@5: 0.980695\n",
+ "[88]\tvalid_0's ndcg@1: 0.949025\tvalid_0's ndcg@2: 0.979136\tvalid_0's ndcg@3: 0.980499\tvalid_0's ndcg@4: 0.980682\tvalid_0's ndcg@5: 0.98072\n",
+ "[89]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979285\tvalid_0's ndcg@3: 0.98061\tvalid_0's ndcg@4: 0.980793\tvalid_0's ndcg@5: 0.980832\n",
+ "[90]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
+ "[91]\tvalid_0's ndcg@1: 0.9493\tvalid_0's ndcg@2: 0.979269\tvalid_0's ndcg@3: 0.980607\tvalid_0's ndcg@4: 0.98079\tvalid_0's ndcg@5: 0.980828\n",
+ "[92]\tvalid_0's ndcg@1: 0.9494\tvalid_0's ndcg@2: 0.97929\tvalid_0's ndcg@3: 0.98064\tvalid_0's ndcg@4: 0.980823\tvalid_0's ndcg@5: 0.980862\n",
+ "[93]\tvalid_0's ndcg@1: 0.949375\tvalid_0's ndcg@2: 0.979297\tvalid_0's ndcg@3: 0.980634\tvalid_0's ndcg@4: 0.980817\tvalid_0's ndcg@5: 0.980856\n",
+ "[94]\tvalid_0's ndcg@1: 0.949525\tvalid_0's ndcg@2: 0.979336\tvalid_0's ndcg@3: 0.980686\tvalid_0's ndcg@4: 0.980869\tvalid_0's ndcg@5: 0.980908\n",
+ "[95]\tvalid_0's ndcg@1: 0.949825\tvalid_0's ndcg@2: 0.979416\tvalid_0's ndcg@3: 0.980791\tvalid_0's ndcg@4: 0.980974\tvalid_0's ndcg@5: 0.981012\n",
+ "[96]\tvalid_0's ndcg@1: 0.94975\tvalid_0's ndcg@2: 0.979404\tvalid_0's ndcg@3: 0.980779\tvalid_0's ndcg@4: 0.980951\tvalid_0's ndcg@5: 0.98099\n",
+ "[97]\tvalid_0's ndcg@1: 0.950025\tvalid_0's ndcg@2: 0.979537\tvalid_0's ndcg@3: 0.980874\tvalid_0's ndcg@4: 0.981057\tvalid_0's ndcg@5: 0.981096\n",
+ "[98]\tvalid_0's ndcg@1: 0.9501\tvalid_0's ndcg@2: 0.979564\tvalid_0's ndcg@3: 0.980889\tvalid_0's ndcg@4: 0.981083\tvalid_0's ndcg@5: 0.981122\n",
+ "[99]\tvalid_0's ndcg@1: 0.950275\tvalid_0's ndcg@2: 0.979629\tvalid_0's ndcg@3: 0.980967\tvalid_0's ndcg@4: 0.98115\tvalid_0's ndcg@5: 0.981188\n",
+ "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.950325\tvalid_0's ndcg@2: 0.979647\tvalid_0's ndcg@3: 0.980985\tvalid_0's ndcg@4: 0.981168\tvalid_0's ndcg@5: 0.981207\n",
+ "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.96382\tvalid_0's ndcg@3: 0.965707\tvalid_0's ndcg@4: 0.966009\tvalid_0's ndcg@5: 0.966086\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.91415\tvalid_0's ndcg@2: 0.965492\tvalid_0's ndcg@3: 0.967254\tvalid_0's ndcg@4: 0.967556\tvalid_0's ndcg@5: 0.967604\n",
+ "[3]\tvalid_0's ndcg@1: 0.916025\tvalid_0's ndcg@2: 0.966389\tvalid_0's ndcg@3: 0.967976\tvalid_0's ndcg@4: 0.968278\tvalid_0's ndcg@5: 0.968355\n",
+ "[4]\tvalid_0's ndcg@1: 0.919\tvalid_0's ndcg@2: 0.967392\tvalid_0's ndcg@3: 0.96903\tvalid_0's ndcg@4: 0.969364\tvalid_0's ndcg@5: 0.969431\n",
+ "[5]\tvalid_0's ndcg@1: 0.921125\tvalid_0's ndcg@2: 0.968192\tvalid_0's ndcg@3: 0.969855\tvalid_0's ndcg@4: 0.970156\tvalid_0's ndcg@5: 0.970224\n",
+ "[6]\tvalid_0's ndcg@1: 0.921675\tvalid_0's ndcg@2: 0.968411\tvalid_0's ndcg@3: 0.970111\tvalid_0's ndcg@4: 0.97037\tvalid_0's ndcg@5: 0.970437\n",
+ "[7]\tvalid_0's ndcg@1: 0.9237\tvalid_0's ndcg@2: 0.969332\tvalid_0's ndcg@3: 0.970882\tvalid_0's ndcg@4: 0.97113\tvalid_0's ndcg@5: 0.971217\n",
+ "[8]\tvalid_0's ndcg@1: 0.925775\tvalid_0's ndcg@2: 0.970129\tvalid_0's ndcg@3: 0.971642\tvalid_0's ndcg@4: 0.971922\tvalid_0's ndcg@5: 0.97199\n",
+ "[9]\tvalid_0's ndcg@1: 0.926775\tvalid_0's ndcg@2: 0.970435\tvalid_0's ndcg@3: 0.971985\tvalid_0's ndcg@4: 0.972276\tvalid_0's ndcg@5: 0.972334\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[10]\tvalid_0's ndcg@1: 0.9277\tvalid_0's ndcg@2: 0.970761\tvalid_0's ndcg@3: 0.972311\tvalid_0's ndcg@4: 0.972612\tvalid_0's ndcg@5: 0.97267\n",
+ "[11]\tvalid_0's ndcg@1: 0.928975\tvalid_0's ndcg@2: 0.97131\tvalid_0's ndcg@3: 0.972798\tvalid_0's ndcg@4: 0.973089\tvalid_0's ndcg@5: 0.973166\n",
+ "[12]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971505\tvalid_0's ndcg@3: 0.972968\tvalid_0's ndcg@4: 0.973259\tvalid_0's ndcg@5: 0.973326\n",
+ "[13]\tvalid_0's ndcg@1: 0.929375\tvalid_0's ndcg@2: 0.971426\tvalid_0's ndcg@3: 0.972939\tvalid_0's ndcg@4: 0.97324\tvalid_0's ndcg@5: 0.973318\n",
+ "[14]\tvalid_0's ndcg@1: 0.929775\tvalid_0's ndcg@2: 0.971621\tvalid_0's ndcg@3: 0.973121\tvalid_0's ndcg@4: 0.973412\tvalid_0's ndcg@5: 0.97348\n",
+ "[15]\tvalid_0's ndcg@1: 0.9304\tvalid_0's ndcg@2: 0.971868\tvalid_0's ndcg@3: 0.97338\tvalid_0's ndcg@4: 0.97365\tvalid_0's ndcg@5: 0.973717\n",
+ "[16]\tvalid_0's ndcg@1: 0.930975\tvalid_0's ndcg@2: 0.972096\tvalid_0's ndcg@3: 0.973558\tvalid_0's ndcg@4: 0.973849\tvalid_0's ndcg@5: 0.973926\n",
+ "[17]\tvalid_0's ndcg@1: 0.93105\tvalid_0's ndcg@2: 0.972108\tvalid_0's ndcg@3: 0.973583\tvalid_0's ndcg@4: 0.973884\tvalid_0's ndcg@5: 0.973952\n",
+ "[18]\tvalid_0's ndcg@1: 0.931725\tvalid_0's ndcg@2: 0.972373\tvalid_0's ndcg@3: 0.97386\tvalid_0's ndcg@4: 0.974129\tvalid_0's ndcg@5: 0.974207\n",
+ "[19]\tvalid_0's ndcg@1: 0.932175\tvalid_0's ndcg@2: 0.972681\tvalid_0's ndcg@3: 0.974068\tvalid_0's ndcg@4: 0.974348\tvalid_0's ndcg@5: 0.974406\n",
+ "[20]\tvalid_0's ndcg@1: 0.93305\tvalid_0's ndcg@2: 0.973019\tvalid_0's ndcg@3: 0.974382\tvalid_0's ndcg@4: 0.974673\tvalid_0's ndcg@5: 0.974731\n",
+ "[21]\tvalid_0's ndcg@1: 0.933075\tvalid_0's ndcg@2: 0.97306\tvalid_0's ndcg@3: 0.974423\tvalid_0's ndcg@4: 0.974703\tvalid_0's ndcg@5: 0.97477\n",
+ "[22]\tvalid_0's ndcg@1: 0.93375\tvalid_0's ndcg@2: 0.973262\tvalid_0's ndcg@3: 0.974649\tvalid_0's ndcg@4: 0.974929\tvalid_0's ndcg@5: 0.975007\n",
+ "[23]\tvalid_0's ndcg@1: 0.933675\tvalid_0's ndcg@2: 0.973219\tvalid_0's ndcg@3: 0.974606\tvalid_0's ndcg@4: 0.974886\tvalid_0's ndcg@5: 0.974973\n",
+ "[24]\tvalid_0's ndcg@1: 0.934\tvalid_0's ndcg@2: 0.97337\tvalid_0's ndcg@3: 0.974745\tvalid_0's ndcg@4: 0.975014\tvalid_0's ndcg@5: 0.975101\n",
+ "[25]\tvalid_0's ndcg@1: 0.934825\tvalid_0's ndcg@2: 0.973674\tvalid_0's ndcg@3: 0.975062\tvalid_0's ndcg@4: 0.975342\tvalid_0's ndcg@5: 0.97541\n",
+ "[26]\tvalid_0's ndcg@1: 0.93495\tvalid_0's ndcg@2: 0.973721\tvalid_0's ndcg@3: 0.975096\tvalid_0's ndcg@4: 0.975365\tvalid_0's ndcg@5: 0.975452\n",
+ "[27]\tvalid_0's ndcg@1: 0.9358\tvalid_0's ndcg@2: 0.974082\tvalid_0's ndcg@3: 0.975444\tvalid_0's ndcg@4: 0.975713\tvalid_0's ndcg@5: 0.975781\n",
+ "[28]\tvalid_0's ndcg@1: 0.935325\tvalid_0's ndcg@2: 0.973875\tvalid_0's ndcg@3: 0.975275\tvalid_0's ndcg@4: 0.975512\tvalid_0's ndcg@5: 0.975599\n",
+ "[29]\tvalid_0's ndcg@1: 0.935925\tvalid_0's ndcg@2: 0.974159\tvalid_0's ndcg@3: 0.975522\tvalid_0's ndcg@4: 0.975759\tvalid_0's ndcg@5: 0.975836\n",
+ "[30]\tvalid_0's ndcg@1: 0.9362\tvalid_0's ndcg@2: 0.974214\tvalid_0's ndcg@3: 0.975589\tvalid_0's ndcg@4: 0.975847\tvalid_0's ndcg@5: 0.975924\n",
+ "[31]\tvalid_0's ndcg@1: 0.93625\tvalid_0's ndcg@2: 0.974216\tvalid_0's ndcg@3: 0.975629\tvalid_0's ndcg@4: 0.975876\tvalid_0's ndcg@5: 0.975944\n",
+ "[32]\tvalid_0's ndcg@1: 0.93665\tvalid_0's ndcg@2: 0.974427\tvalid_0's ndcg@3: 0.975814\tvalid_0's ndcg@4: 0.97603\tvalid_0's ndcg@5: 0.976107\n",
+ "[33]\tvalid_0's ndcg@1: 0.936775\tvalid_0's ndcg@2: 0.974505\tvalid_0's ndcg@3: 0.975855\tvalid_0's ndcg@4: 0.976081\tvalid_0's ndcg@5: 0.976158\n",
+ "[34]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974643\tvalid_0's ndcg@3: 0.975993\tvalid_0's ndcg@4: 0.976219\tvalid_0's ndcg@5: 0.976296\n",
+ "[35]\tvalid_0's ndcg@1: 0.937675\tvalid_0's ndcg@2: 0.974805\tvalid_0's ndcg@3: 0.97618\tvalid_0's ndcg@4: 0.976406\tvalid_0's ndcg@5: 0.976484\n",
+ "[36]\tvalid_0's ndcg@1: 0.9382\tvalid_0's ndcg@2: 0.974983\tvalid_0's ndcg@3: 0.976371\tvalid_0's ndcg@4: 0.976597\tvalid_0's ndcg@5: 0.976674\n",
+ "[37]\tvalid_0's ndcg@1: 0.938175\tvalid_0's ndcg@2: 0.974974\tvalid_0's ndcg@3: 0.976349\tvalid_0's ndcg@4: 0.976586\tvalid_0's ndcg@5: 0.976663\n",
+ "[38]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.975143\tvalid_0's ndcg@3: 0.976518\tvalid_0's ndcg@4: 0.976776\tvalid_0's ndcg@5: 0.976844\n",
+ "[39]\tvalid_0's ndcg@1: 0.938575\tvalid_0's ndcg@2: 0.975106\tvalid_0's ndcg@3: 0.976481\tvalid_0's ndcg@4: 0.976739\tvalid_0's ndcg@5: 0.976807\n",
+ "[40]\tvalid_0's ndcg@1: 0.938675\tvalid_0's ndcg@2: 0.97519\tvalid_0's ndcg@3: 0.976528\tvalid_0's ndcg@4: 0.976775\tvalid_0's ndcg@5: 0.976853\n",
+ "[41]\tvalid_0's ndcg@1: 0.9391\tvalid_0's ndcg@2: 0.975347\tvalid_0's ndcg@3: 0.976697\tvalid_0's ndcg@4: 0.976934\tvalid_0's ndcg@5: 0.977001\n",
+ "[42]\tvalid_0's ndcg@1: 0.939825\tvalid_0's ndcg@2: 0.975599\tvalid_0's ndcg@3: 0.976961\tvalid_0's ndcg@4: 0.977198\tvalid_0's ndcg@5: 0.977266\n",
+ "[43]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975639\tvalid_0's ndcg@3: 0.976977\tvalid_0's ndcg@4: 0.977214\tvalid_0's ndcg@5: 0.977282\n",
+ "[44]\tvalid_0's ndcg@1: 0.9398\tvalid_0's ndcg@2: 0.975605\tvalid_0's ndcg@3: 0.976955\tvalid_0's ndcg@4: 0.977192\tvalid_0's ndcg@5: 0.97726\n",
+ "[45]\tvalid_0's ndcg@1: 0.9401\tvalid_0's ndcg@2: 0.9757\tvalid_0's ndcg@3: 0.977075\tvalid_0's ndcg@4: 0.977291\tvalid_0's ndcg@5: 0.977368\n",
+ "[46]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975845\tvalid_0's ndcg@3: 0.977183\tvalid_0's ndcg@4: 0.97742\tvalid_0's ndcg@5: 0.977497\n",
+ "[47]\tvalid_0's ndcg@1: 0.940475\tvalid_0's ndcg@2: 0.975854\tvalid_0's ndcg@3: 0.977204\tvalid_0's ndcg@4: 0.97743\tvalid_0's ndcg@5: 0.977508\n",
+ "[48]\tvalid_0's ndcg@1: 0.940575\tvalid_0's ndcg@2: 0.975923\tvalid_0's ndcg@3: 0.977273\tvalid_0's ndcg@4: 0.977488\tvalid_0's ndcg@5: 0.977556\n",
+ "[49]\tvalid_0's ndcg@1: 0.9407\tvalid_0's ndcg@2: 0.975922\tvalid_0's ndcg@3: 0.977297\tvalid_0's ndcg@4: 0.977501\tvalid_0's ndcg@5: 0.977588\n",
+ "[50]\tvalid_0's ndcg@1: 0.940725\tvalid_0's ndcg@2: 0.975947\tvalid_0's ndcg@3: 0.977322\tvalid_0's ndcg@4: 0.977505\tvalid_0's ndcg@5: 0.977592\n",
+ "[51]\tvalid_0's ndcg@1: 0.9406\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.97725\tvalid_0's ndcg@4: 0.977422\tvalid_0's ndcg@5: 0.977509\n",
+ "[52]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975997\tvalid_0's ndcg@3: 0.977422\tvalid_0's ndcg@4: 0.977594\tvalid_0's ndcg@5: 0.977691\n",
+ "[53]\tvalid_0's ndcg@1: 0.940925\tvalid_0's ndcg@2: 0.975989\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977538\tvalid_0's ndcg@5: 0.977644\n",
+ "[54]\tvalid_0's ndcg@1: 0.94125\tvalid_0's ndcg@2: 0.976062\tvalid_0's ndcg@3: 0.977487\tvalid_0's ndcg@4: 0.977659\tvalid_0's ndcg@5: 0.977756\n",
+ "[55]\tvalid_0's ndcg@1: 0.94145\tvalid_0's ndcg@2: 0.976183\tvalid_0's ndcg@3: 0.97757\tvalid_0's ndcg@4: 0.977742\tvalid_0's ndcg@5: 0.977839\n",
+ "[56]\tvalid_0's ndcg@1: 0.941475\tvalid_0's ndcg@2: 0.976176\tvalid_0's ndcg@3: 0.977576\tvalid_0's ndcg@4: 0.977748\tvalid_0's ndcg@5: 0.977845\n",
+ "[57]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.976139\tvalid_0's ndcg@3: 0.977539\tvalid_0's ndcg@4: 0.977712\tvalid_0's ndcg@5: 0.977808\n",
+ "[58]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.97625\tvalid_0's ndcg@3: 0.97765\tvalid_0's ndcg@4: 0.977822\tvalid_0's ndcg@5: 0.977919\n",
+ "[59]\tvalid_0's ndcg@1: 0.941725\tvalid_0's ndcg@2: 0.976253\tvalid_0's ndcg@3: 0.977653\tvalid_0's ndcg@4: 0.977836\tvalid_0's ndcg@5: 0.977932\n",
+ "[60]\tvalid_0's ndcg@1: 0.941675\tvalid_0's ndcg@2: 0.976234\tvalid_0's ndcg@3: 0.977634\tvalid_0's ndcg@4: 0.977817\tvalid_0's ndcg@5: 0.977914\n",
+ "[61]\tvalid_0's ndcg@1: 0.9419\tvalid_0's ndcg@2: 0.976333\tvalid_0's ndcg@3: 0.977745\tvalid_0's ndcg@4: 0.977918\tvalid_0's ndcg@5: 0.978005\n",
+ "[62]\tvalid_0's ndcg@1: 0.941975\tvalid_0's ndcg@2: 0.976345\tvalid_0's ndcg@3: 0.977757\tvalid_0's ndcg@4: 0.97794\tvalid_0's ndcg@5: 0.978027\n",
+ "[63]\tvalid_0's ndcg@1: 0.9423\tvalid_0's ndcg@2: 0.976496\tvalid_0's ndcg@3: 0.977871\tvalid_0's ndcg@4: 0.978065\tvalid_0's ndcg@5: 0.978152\n",
+ "[64]\tvalid_0's ndcg@1: 0.942625\tvalid_0's ndcg@2: 0.976632\tvalid_0's ndcg@3: 0.977995\tvalid_0's ndcg@4: 0.978188\tvalid_0's ndcg@5: 0.978275\n",
+ "[65]\tvalid_0's ndcg@1: 0.942575\tvalid_0's ndcg@2: 0.976629\tvalid_0's ndcg@3: 0.977979\tvalid_0's ndcg@4: 0.978173\tvalid_0's ndcg@5: 0.97826\n",
+ "[66]\tvalid_0's ndcg@1: 0.942725\tvalid_0's ndcg@2: 0.976685\tvalid_0's ndcg@3: 0.978035\tvalid_0's ndcg@4: 0.978229\tvalid_0's ndcg@5: 0.978316\n",
+ "[67]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976678\tvalid_0's ndcg@3: 0.978041\tvalid_0's ndcg@4: 0.978224\tvalid_0's ndcg@5: 0.97832\n",
+ "[68]\tvalid_0's ndcg@1: 0.94275\tvalid_0's ndcg@2: 0.976694\tvalid_0's ndcg@3: 0.978044\tvalid_0's ndcg@4: 0.978227\tvalid_0's ndcg@5: 0.978324\n",
+ "[69]\tvalid_0's ndcg@1: 0.943\tvalid_0's ndcg@2: 0.976834\tvalid_0's ndcg@3: 0.978146\tvalid_0's ndcg@4: 0.978329\tvalid_0's ndcg@5: 0.978426\n",
+ "[70]\tvalid_0's ndcg@1: 0.943025\tvalid_0's ndcg@2: 0.976827\tvalid_0's ndcg@3: 0.978152\tvalid_0's ndcg@4: 0.978324\tvalid_0's ndcg@5: 0.978431\n",
+ "[71]\tvalid_0's ndcg@1: 0.9432\tvalid_0's ndcg@2: 0.976923\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978504\n",
+ "[72]\tvalid_0's ndcg@1: 0.943225\tvalid_0's ndcg@2: 0.976917\tvalid_0's ndcg@3: 0.978254\tvalid_0's ndcg@4: 0.978405\tvalid_0's ndcg@5: 0.978511\n",
+ "[73]\tvalid_0's ndcg@1: 0.94315\tvalid_0's ndcg@2: 0.976936\tvalid_0's ndcg@3: 0.978236\tvalid_0's ndcg@4: 0.978409\tvalid_0's ndcg@5: 0.978496\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[74]\tvalid_0's ndcg@1: 0.94325\tvalid_0's ndcg@2: 0.976957\tvalid_0's ndcg@3: 0.97827\tvalid_0's ndcg@4: 0.978431\tvalid_0's ndcg@5: 0.978528\n",
+ "[75]\tvalid_0's ndcg@1: 0.943075\tvalid_0's ndcg@2: 0.976861\tvalid_0's ndcg@3: 0.978199\tvalid_0's ndcg@4: 0.97836\tvalid_0's ndcg@5: 0.978457\n",
+ "[76]\tvalid_0's ndcg@1: 0.94335\tvalid_0's ndcg@2: 0.976963\tvalid_0's ndcg@3: 0.978288\tvalid_0's ndcg@4: 0.978471\tvalid_0's ndcg@5: 0.978568\n",
+ "[77]\tvalid_0's ndcg@1: 0.94345\tvalid_0's ndcg@2: 0.977031\tvalid_0's ndcg@3: 0.978331\tvalid_0's ndcg@4: 0.978514\tvalid_0's ndcg@5: 0.978611\n",
+ "[78]\tvalid_0's ndcg@1: 0.943475\tvalid_0's ndcg@2: 0.977088\tvalid_0's ndcg@3: 0.97835\tvalid_0's ndcg@4: 0.978533\tvalid_0's ndcg@5: 0.97863\n",
+ "[79]\tvalid_0's ndcg@1: 0.943625\tvalid_0's ndcg@2: 0.977096\tvalid_0's ndcg@3: 0.978396\tvalid_0's ndcg@4: 0.978579\tvalid_0's ndcg@5: 0.978676\n",
+ "[80]\tvalid_0's ndcg@1: 0.943825\tvalid_0's ndcg@2: 0.977154\tvalid_0's ndcg@3: 0.978479\tvalid_0's ndcg@4: 0.978651\tvalid_0's ndcg@5: 0.978748\n",
+ "[81]\tvalid_0's ndcg@1: 0.943775\tvalid_0's ndcg@2: 0.977135\tvalid_0's ndcg@3: 0.97846\tvalid_0's ndcg@4: 0.978633\tvalid_0's ndcg@5: 0.978729\n",
+ "[82]\tvalid_0's ndcg@1: 0.9443\tvalid_0's ndcg@2: 0.977361\tvalid_0's ndcg@3: 0.978673\tvalid_0's ndcg@4: 0.978845\tvalid_0's ndcg@5: 0.978933\n",
+ "[83]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.977324\tvalid_0's ndcg@3: 0.978624\tvalid_0's ndcg@4: 0.978796\tvalid_0's ndcg@5: 0.978893\n",
+ "[84]\tvalid_0's ndcg@1: 0.94405\tvalid_0's ndcg@2: 0.977253\tvalid_0's ndcg@3: 0.978565\tvalid_0's ndcg@4: 0.978737\tvalid_0's ndcg@5: 0.978834\n",
+ "[85]\tvalid_0's ndcg@1: 0.944175\tvalid_0's ndcg@2: 0.977283\tvalid_0's ndcg@3: 0.978633\tvalid_0's ndcg@4: 0.978795\tvalid_0's ndcg@5: 0.978882\n",
+ "[86]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.97745\tvalid_0's ndcg@3: 0.978763\tvalid_0's ndcg@4: 0.978924\tvalid_0's ndcg@5: 0.979011\n",
+ "[87]\tvalid_0's ndcg@1: 0.9445\tvalid_0's ndcg@2: 0.977419\tvalid_0's ndcg@3: 0.978756\tvalid_0's ndcg@4: 0.978918\tvalid_0's ndcg@5: 0.979005\n",
+ "[88]\tvalid_0's ndcg@1: 0.944825\tvalid_0's ndcg@2: 0.977554\tvalid_0's ndcg@3: 0.978867\tvalid_0's ndcg@4: 0.979039\tvalid_0's ndcg@5: 0.979126\n",
+ "[89]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977767\tvalid_0's ndcg@3: 0.979079\tvalid_0's ndcg@4: 0.979262\tvalid_0's ndcg@5: 0.97934\n",
+ "[90]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977773\tvalid_0's ndcg@3: 0.979073\tvalid_0's ndcg@4: 0.979256\tvalid_0's ndcg@5: 0.979334\n",
+ "[91]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977792\tvalid_0's ndcg@3: 0.979092\tvalid_0's ndcg@4: 0.979275\tvalid_0's ndcg@5: 0.979352\n",
+ "[92]\tvalid_0's ndcg@1: 0.945425\tvalid_0's ndcg@2: 0.977776\tvalid_0's ndcg@3: 0.979088\tvalid_0's ndcg@4: 0.979261\tvalid_0's ndcg@5: 0.979348\n",
+ "[93]\tvalid_0's ndcg@1: 0.945375\tvalid_0's ndcg@2: 0.977757\tvalid_0's ndcg@3: 0.979082\tvalid_0's ndcg@4: 0.979244\tvalid_0's ndcg@5: 0.979331\n",
+ "[94]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977761\tvalid_0's ndcg@3: 0.979061\tvalid_0's ndcg@4: 0.979223\tvalid_0's ndcg@5: 0.97931\n",
+ "[95]\tvalid_0's ndcg@1: 0.9454\tvalid_0's ndcg@2: 0.977798\tvalid_0's ndcg@3: 0.979086\tvalid_0's ndcg@4: 0.979258\tvalid_0's ndcg@5: 0.979345\n",
+ "[96]\tvalid_0's ndcg@1: 0.945825\tvalid_0's ndcg@2: 0.977955\tvalid_0's ndcg@3: 0.97923\tvalid_0's ndcg@4: 0.979413\tvalid_0's ndcg@5: 0.9795\n",
+ "[97]\tvalid_0's ndcg@1: 0.945925\tvalid_0's ndcg@2: 0.97796\tvalid_0's ndcg@3: 0.97926\tvalid_0's ndcg@4: 0.979443\tvalid_0's ndcg@5: 0.979531\n",
+ "[98]\tvalid_0's ndcg@1: 0.9464\tvalid_0's ndcg@2: 0.97812\tvalid_0's ndcg@3: 0.97942\tvalid_0's ndcg@4: 0.979625\tvalid_0's ndcg@5: 0.979702\n",
+ "[99]\tvalid_0's ndcg@1: 0.94655\tvalid_0's ndcg@2: 0.978191\tvalid_0's ndcg@3: 0.979479\tvalid_0's ndcg@4: 0.979683\tvalid_0's ndcg@5: 0.97977\n",
+ "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.978244\tvalid_0's ndcg@3: 0.979531\tvalid_0's ndcg@4: 0.979725\tvalid_0's ndcg@5: 0.979812\n",
+ "[1]\tvalid_0's ndcg@1: 0.910175\tvalid_0's ndcg@2: 0.963031\tvalid_0's ndcg@3: 0.965281\tvalid_0's ndcg@4: 0.965819\tvalid_0's ndcg@5: 0.965887\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's ndcg@1: 0.9141\tvalid_0's ndcg@2: 0.964748\tvalid_0's ndcg@3: 0.96681\tvalid_0's ndcg@4: 0.967316\tvalid_0's ndcg@5: 0.967394\n",
+ "[3]\tvalid_0's ndcg@1: 0.915925\tvalid_0's ndcg@2: 0.9655\tvalid_0's ndcg@3: 0.967575\tvalid_0's ndcg@4: 0.968028\tvalid_0's ndcg@5: 0.968105\n",
+ "[4]\tvalid_0's ndcg@1: 0.91915\tvalid_0's ndcg@2: 0.966943\tvalid_0's ndcg@3: 0.968968\tvalid_0's ndcg@4: 0.969334\tvalid_0's ndcg@5: 0.969373\n",
+ "[5]\tvalid_0's ndcg@1: 0.920625\tvalid_0's ndcg@2: 0.967598\tvalid_0's ndcg@3: 0.969498\tvalid_0's ndcg@4: 0.969896\tvalid_0's ndcg@5: 0.969944\n",
+ "[6]\tvalid_0's ndcg@1: 0.922625\tvalid_0's ndcg@2: 0.968336\tvalid_0's ndcg@3: 0.970261\tvalid_0's ndcg@4: 0.970659\tvalid_0's ndcg@5: 0.970688\n",
+ "[7]\tvalid_0's ndcg@1: 0.923625\tvalid_0's ndcg@2: 0.968768\tvalid_0's ndcg@3: 0.970656\tvalid_0's ndcg@4: 0.971043\tvalid_0's ndcg@5: 0.971072\n",
+ "[8]\tvalid_0's ndcg@1: 0.925825\tvalid_0's ndcg@2: 0.969612\tvalid_0's ndcg@3: 0.971462\tvalid_0's ndcg@4: 0.97186\tvalid_0's ndcg@5: 0.971879\n",
+ "[9]\tvalid_0's ndcg@1: 0.926475\tvalid_0's ndcg@2: 0.969899\tvalid_0's ndcg@3: 0.971711\tvalid_0's ndcg@4: 0.97211\tvalid_0's ndcg@5: 0.972129\n",
+ "[10]\tvalid_0's ndcg@1: 0.927775\tvalid_0's ndcg@2: 0.97041\tvalid_0's ndcg@3: 0.972185\tvalid_0's ndcg@4: 0.972594\tvalid_0's ndcg@5: 0.972614\n",
+ "[11]\tvalid_0's ndcg@1: 0.92885\tvalid_0's ndcg@2: 0.970838\tvalid_0's ndcg@3: 0.972588\tvalid_0's ndcg@4: 0.973008\tvalid_0's ndcg@5: 0.973028\n",
+ "[12]\tvalid_0's ndcg@1: 0.930325\tvalid_0's ndcg@2: 0.971367\tvalid_0's ndcg@3: 0.973129\tvalid_0's ndcg@4: 0.973549\tvalid_0's ndcg@5: 0.973569\n",
+ "[13]\tvalid_0's ndcg@1: 0.931125\tvalid_0's ndcg@2: 0.971631\tvalid_0's ndcg@3: 0.973443\tvalid_0's ndcg@4: 0.973842\tvalid_0's ndcg@5: 0.973871\n",
+ "[14]\tvalid_0's ndcg@1: 0.931525\tvalid_0's ndcg@2: 0.971778\tvalid_0's ndcg@3: 0.973616\tvalid_0's ndcg@4: 0.973993\tvalid_0's ndcg@5: 0.974022\n",
+ "[15]\tvalid_0's ndcg@1: 0.9311\tvalid_0's ndcg@2: 0.9717\tvalid_0's ndcg@3: 0.973475\tvalid_0's ndcg@4: 0.973852\tvalid_0's ndcg@5: 0.973872\n",
+ "[16]\tvalid_0's ndcg@1: 0.931775\tvalid_0's ndcg@2: 0.971902\tvalid_0's ndcg@3: 0.973702\tvalid_0's ndcg@4: 0.97409\tvalid_0's ndcg@5: 0.974109\n",
+ "[17]\tvalid_0's ndcg@1: 0.931425\tvalid_0's ndcg@2: 0.971805\tvalid_0's ndcg@3: 0.97358\tvalid_0's ndcg@4: 0.973967\tvalid_0's ndcg@5: 0.973986\n",
+ "[18]\tvalid_0's ndcg@1: 0.931575\tvalid_0's ndcg@2: 0.971876\tvalid_0's ndcg@3: 0.973651\tvalid_0's ndcg@4: 0.974027\tvalid_0's ndcg@5: 0.974047\n",
+ "[19]\tvalid_0's ndcg@1: 0.932\tvalid_0's ndcg@2: 0.97208\tvalid_0's ndcg@3: 0.973805\tvalid_0's ndcg@4: 0.974192\tvalid_0's ndcg@5: 0.974212\n",
+ "[20]\tvalid_0's ndcg@1: 0.932075\tvalid_0's ndcg@2: 0.972092\tvalid_0's ndcg@3: 0.973829\tvalid_0's ndcg@4: 0.974217\tvalid_0's ndcg@5: 0.974236\n",
+ "[21]\tvalid_0's ndcg@1: 0.932675\tvalid_0's ndcg@2: 0.972282\tvalid_0's ndcg@3: 0.974057\tvalid_0's ndcg@4: 0.974444\tvalid_0's ndcg@5: 0.974454\n",
+ "[22]\tvalid_0's ndcg@1: 0.932925\tvalid_0's ndcg@2: 0.972358\tvalid_0's ndcg@3: 0.974146\tvalid_0's ndcg@4: 0.974533\tvalid_0's ndcg@5: 0.974543\n",
+ "[23]\tvalid_0's ndcg@1: 0.93325\tvalid_0's ndcg@2: 0.972478\tvalid_0's ndcg@3: 0.974253\tvalid_0's ndcg@4: 0.974651\tvalid_0's ndcg@5: 0.974661\n",
+ "[24]\tvalid_0's ndcg@1: 0.9335\tvalid_0's ndcg@2: 0.972539\tvalid_0's ndcg@3: 0.974351\tvalid_0's ndcg@4: 0.974739\tvalid_0's ndcg@5: 0.974749\n",
+ "[25]\tvalid_0's ndcg@1: 0.93475\tvalid_0's ndcg@2: 0.973\tvalid_0's ndcg@3: 0.974788\tvalid_0's ndcg@4: 0.975197\tvalid_0's ndcg@5: 0.975206\n",
+ "[26]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.97312\tvalid_0's ndcg@3: 0.974895\tvalid_0's ndcg@4: 0.975315\tvalid_0's ndcg@5: 0.975325\n",
+ "[27]\tvalid_0's ndcg@1: 0.9349\tvalid_0's ndcg@2: 0.973103\tvalid_0's ndcg@3: 0.974865\tvalid_0's ndcg@4: 0.975264\tvalid_0's ndcg@5: 0.975273\n",
+ "[28]\tvalid_0's ndcg@1: 0.935075\tvalid_0's ndcg@2: 0.973152\tvalid_0's ndcg@3: 0.974939\tvalid_0's ndcg@4: 0.975327\tvalid_0's ndcg@5: 0.975336\n",
+ "[29]\tvalid_0's ndcg@1: 0.935475\tvalid_0's ndcg@2: 0.973315\tvalid_0's ndcg@3: 0.975128\tvalid_0's ndcg@4: 0.975483\tvalid_0's ndcg@5: 0.975492\n",
+ "[30]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973522\tvalid_0's ndcg@3: 0.975297\tvalid_0's ndcg@4: 0.975663\tvalid_0's ndcg@5: 0.975673\n",
+ "[31]\tvalid_0's ndcg@1: 0.93595\tvalid_0's ndcg@2: 0.973506\tvalid_0's ndcg@3: 0.975281\tvalid_0's ndcg@4: 0.975658\tvalid_0's ndcg@5: 0.975668\n",
+ "[32]\tvalid_0's ndcg@1: 0.93675\tvalid_0's ndcg@2: 0.973833\tvalid_0's ndcg@3: 0.975595\tvalid_0's ndcg@4: 0.975961\tvalid_0's ndcg@5: 0.975971\n",
+ "[33]\tvalid_0's ndcg@1: 0.936475\tvalid_0's ndcg@2: 0.973763\tvalid_0's ndcg@3: 0.975488\tvalid_0's ndcg@4: 0.975865\tvalid_0's ndcg@5: 0.975874\n",
+ "[34]\tvalid_0's ndcg@1: 0.9367\tvalid_0's ndcg@2: 0.973893\tvalid_0's ndcg@3: 0.975568\tvalid_0's ndcg@4: 0.975956\tvalid_0's ndcg@5: 0.975966\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[35]\tvalid_0's ndcg@1: 0.93715\tvalid_0's ndcg@2: 0.974059\tvalid_0's ndcg@3: 0.975722\tvalid_0's ndcg@4: 0.97612\tvalid_0's ndcg@5: 0.97613\n",
+ "[36]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
+ "[37]\tvalid_0's ndcg@1: 0.9374\tvalid_0's ndcg@2: 0.974183\tvalid_0's ndcg@3: 0.975846\tvalid_0's ndcg@4: 0.976223\tvalid_0's ndcg@5: 0.976232\n",
+ "[38]\tvalid_0's ndcg@1: 0.938725\tvalid_0's ndcg@2: 0.974672\tvalid_0's ndcg@3: 0.97636\tvalid_0's ndcg@4: 0.976715\tvalid_0's ndcg@5: 0.976725\n",
+ "[39]\tvalid_0's ndcg@1: 0.93865\tvalid_0's ndcg@2: 0.974676\tvalid_0's ndcg@3: 0.976364\tvalid_0's ndcg@4: 0.976697\tvalid_0's ndcg@5: 0.976707\n",
+ "[40]\tvalid_0's ndcg@1: 0.939125\tvalid_0's ndcg@2: 0.974867\tvalid_0's ndcg@3: 0.97653\tvalid_0's ndcg@4: 0.976874\tvalid_0's ndcg@5: 0.976884\n",
+ "[41]\tvalid_0's ndcg@1: 0.9396\tvalid_0's ndcg@2: 0.975042\tvalid_0's ndcg@3: 0.976705\tvalid_0's ndcg@4: 0.97705\tvalid_0's ndcg@5: 0.977059\n",
+ "[42]\tvalid_0's ndcg@1: 0.93985\tvalid_0's ndcg@2: 0.975072\tvalid_0's ndcg@3: 0.976784\tvalid_0's ndcg@4: 0.977129\tvalid_0's ndcg@5: 0.977138\n",
+ "[43]\tvalid_0's ndcg@1: 0.940075\tvalid_0's ndcg@2: 0.97517\tvalid_0's ndcg@3: 0.97687\tvalid_0's ndcg@4: 0.977215\tvalid_0's ndcg@5: 0.977225\n",
+ "[44]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.97534\tvalid_0's ndcg@3: 0.977015\tvalid_0's ndcg@4: 0.97736\tvalid_0's ndcg@5: 0.97737\n",
+ "[45]\tvalid_0's ndcg@1: 0.94055\tvalid_0's ndcg@2: 0.975409\tvalid_0's ndcg@3: 0.977059\tvalid_0's ndcg@4: 0.977403\tvalid_0's ndcg@5: 0.977413\n",
+ "[46]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975415\tvalid_0's ndcg@3: 0.97704\tvalid_0's ndcg@4: 0.977396\tvalid_0's ndcg@5: 0.977405\n",
+ "[47]\tvalid_0's ndcg@1: 0.940425\tvalid_0's ndcg@2: 0.975363\tvalid_0's ndcg@3: 0.977013\tvalid_0's ndcg@4: 0.977357\tvalid_0's ndcg@5: 0.977367\n",
+ "[48]\tvalid_0's ndcg@1: 0.94045\tvalid_0's ndcg@2: 0.975388\tvalid_0's ndcg@3: 0.977025\tvalid_0's ndcg@4: 0.97737\tvalid_0's ndcg@5: 0.977379\n",
+ "[49]\tvalid_0's ndcg@1: 0.940525\tvalid_0's ndcg@2: 0.975447\tvalid_0's ndcg@3: 0.977097\tvalid_0's ndcg@4: 0.977409\tvalid_0's ndcg@5: 0.977419\n",
+ "[50]\tvalid_0's ndcg@1: 0.941075\tvalid_0's ndcg@2: 0.975666\tvalid_0's ndcg@3: 0.977303\tvalid_0's ndcg@4: 0.977615\tvalid_0's ndcg@5: 0.977625\n",
+ "[51]\tvalid_0's ndcg@1: 0.94135\tvalid_0's ndcg@2: 0.975751\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.97771\tvalid_0's ndcg@5: 0.97772\n",
+ "[52]\tvalid_0's ndcg@1: 0.9413\tvalid_0's ndcg@2: 0.975717\tvalid_0's ndcg@3: 0.977355\tvalid_0's ndcg@4: 0.977688\tvalid_0's ndcg@5: 0.977698\n",
+ "[53]\tvalid_0's ndcg@1: 0.941375\tvalid_0's ndcg@2: 0.975713\tvalid_0's ndcg@3: 0.977376\tvalid_0's ndcg@4: 0.977699\tvalid_0's ndcg@5: 0.977718\n",
+ "[54]\tvalid_0's ndcg@1: 0.94185\tvalid_0's ndcg@2: 0.975857\tvalid_0's ndcg@3: 0.977557\tvalid_0's ndcg@4: 0.977869\tvalid_0's ndcg@5: 0.977889\n",
+ "[55]\tvalid_0's ndcg@1: 0.941925\tvalid_0's ndcg@2: 0.975837\tvalid_0's ndcg@3: 0.9776\tvalid_0's ndcg@4: 0.977891\tvalid_0's ndcg@5: 0.97791\n",
+ "[56]\tvalid_0's ndcg@1: 0.942325\tvalid_0's ndcg@2: 0.975969\tvalid_0's ndcg@3: 0.977719\tvalid_0's ndcg@4: 0.978032\tvalid_0's ndcg@5: 0.978051\n",
+ "[57]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976022\tvalid_0's ndcg@3: 0.977772\tvalid_0's ndcg@4: 0.978073\tvalid_0's ndcg@5: 0.978093\n",
+ "[58]\tvalid_0's ndcg@1: 0.9425\tvalid_0's ndcg@2: 0.976081\tvalid_0's ndcg@3: 0.977806\tvalid_0's ndcg@4: 0.978108\tvalid_0's ndcg@5: 0.978127\n",
+ "[59]\tvalid_0's ndcg@1: 0.9424\tvalid_0's ndcg@2: 0.976076\tvalid_0's ndcg@3: 0.977788\tvalid_0's ndcg@4: 0.978079\tvalid_0's ndcg@5: 0.978098\n",
+ "[60]\tvalid_0's ndcg@1: 0.942375\tvalid_0's ndcg@2: 0.976067\tvalid_0's ndcg@3: 0.977779\tvalid_0's ndcg@4: 0.97807\tvalid_0's ndcg@5: 0.978089\n",
+ "[61]\tvalid_0's ndcg@1: 0.942225\tvalid_0's ndcg@2: 0.976043\tvalid_0's ndcg@3: 0.97773\tvalid_0's ndcg@4: 0.978021\tvalid_0's ndcg@5: 0.97804\n",
+ "[62]\tvalid_0's ndcg@1: 0.942425\tvalid_0's ndcg@2: 0.976117\tvalid_0's ndcg@3: 0.977792\tvalid_0's ndcg@4: 0.978093\tvalid_0's ndcg@5: 0.978112\n",
+ "[63]\tvalid_0's ndcg@1: 0.942675\tvalid_0's ndcg@2: 0.976193\tvalid_0's ndcg@3: 0.977881\tvalid_0's ndcg@4: 0.978182\tvalid_0's ndcg@5: 0.978201\n",
+ "[64]\tvalid_0's ndcg@1: 0.942925\tvalid_0's ndcg@2: 0.976254\tvalid_0's ndcg@3: 0.977966\tvalid_0's ndcg@4: 0.978268\tvalid_0's ndcg@5: 0.978287\n",
+ "[65]\tvalid_0's ndcg@1: 0.9431\tvalid_0's ndcg@2: 0.97635\tvalid_0's ndcg@3: 0.978025\tvalid_0's ndcg@4: 0.978337\tvalid_0's ndcg@5: 0.978357\n",
+ "[66]\tvalid_0's ndcg@1: 0.9434\tvalid_0's ndcg@2: 0.976445\tvalid_0's ndcg@3: 0.978132\tvalid_0's ndcg@4: 0.978445\tvalid_0's ndcg@5: 0.978464\n",
+ "[67]\tvalid_0's ndcg@1: 0.943275\tvalid_0's ndcg@2: 0.976399\tvalid_0's ndcg@3: 0.978074\tvalid_0's ndcg@4: 0.978397\tvalid_0's ndcg@5: 0.978416\n",
+ "[68]\tvalid_0's ndcg@1: 0.943325\tvalid_0's ndcg@2: 0.976401\tvalid_0's ndcg@3: 0.978089\tvalid_0's ndcg@4: 0.978412\tvalid_0's ndcg@5: 0.978431\n",
+ "[69]\tvalid_0's ndcg@1: 0.943675\tvalid_0's ndcg@2: 0.976578\tvalid_0's ndcg@3: 0.97819\tvalid_0's ndcg@4: 0.978546\tvalid_0's ndcg@5: 0.978565\n",
+ "[70]\tvalid_0's ndcg@1: 0.944025\tvalid_0's ndcg@2: 0.976707\tvalid_0's ndcg@3: 0.97832\tvalid_0's ndcg@4: 0.978675\tvalid_0's ndcg@5: 0.978694\n",
+ "[71]\tvalid_0's ndcg@1: 0.9442\tvalid_0's ndcg@2: 0.976772\tvalid_0's ndcg@3: 0.978384\tvalid_0's ndcg@4: 0.97874\tvalid_0's ndcg@5: 0.978759\n",
+ "[72]\tvalid_0's ndcg@1: 0.94425\tvalid_0's ndcg@2: 0.976822\tvalid_0's ndcg@3: 0.978409\tvalid_0's ndcg@4: 0.978765\tvalid_0's ndcg@5: 0.978784\n",
+ "[73]\tvalid_0's ndcg@1: 0.94445\tvalid_0's ndcg@2: 0.976864\tvalid_0's ndcg@3: 0.978464\tvalid_0's ndcg@4: 0.97883\tvalid_0's ndcg@5: 0.978849\n",
+ "[74]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
+ "[75]\tvalid_0's ndcg@1: 0.9446\tvalid_0's ndcg@2: 0.976919\tvalid_0's ndcg@3: 0.978519\tvalid_0's ndcg@4: 0.978885\tvalid_0's ndcg@5: 0.978905\n",
+ "[76]\tvalid_0's ndcg@1: 0.944625\tvalid_0's ndcg@2: 0.97696\tvalid_0's ndcg@3: 0.978535\tvalid_0's ndcg@4: 0.978901\tvalid_0's ndcg@5: 0.978921\n",
+ "[77]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
+ "[78]\tvalid_0's ndcg@1: 0.944675\tvalid_0's ndcg@2: 0.976979\tvalid_0's ndcg@3: 0.978554\tvalid_0's ndcg@4: 0.97892\tvalid_0's ndcg@5: 0.978939\n",
+ "[79]\tvalid_0's ndcg@1: 0.944525\tvalid_0's ndcg@2: 0.976907\tvalid_0's ndcg@3: 0.978507\tvalid_0's ndcg@4: 0.978863\tvalid_0's ndcg@5: 0.978882\n",
+ "[80]\tvalid_0's ndcg@1: 0.94455\tvalid_0's ndcg@2: 0.976885\tvalid_0's ndcg@3: 0.97851\tvalid_0's ndcg@4: 0.978865\tvalid_0's ndcg@5: 0.978885\n",
+ "[81]\tvalid_0's ndcg@1: 0.944725\tvalid_0's ndcg@2: 0.97695\tvalid_0's ndcg@3: 0.978575\tvalid_0's ndcg@4: 0.978919\tvalid_0's ndcg@5: 0.978948\n",
+ "[82]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.977103\tvalid_0's ndcg@3: 0.978765\tvalid_0's ndcg@4: 0.97911\tvalid_0's ndcg@5: 0.979129\n",
+ "[83]\tvalid_0's ndcg@1: 0.945125\tvalid_0's ndcg@2: 0.977066\tvalid_0's ndcg@3: 0.978716\tvalid_0's ndcg@4: 0.979071\tvalid_0's ndcg@5: 0.97909\n",
+ "[84]\tvalid_0's ndcg@1: 0.945225\tvalid_0's ndcg@2: 0.97715\tvalid_0's ndcg@3: 0.978775\tvalid_0's ndcg@4: 0.97912\tvalid_0's ndcg@5: 0.979139\n",
+ "[85]\tvalid_0's ndcg@1: 0.945025\tvalid_0's ndcg@2: 0.977092\tvalid_0's ndcg@3: 0.978692\tvalid_0's ndcg@4: 0.979047\tvalid_0's ndcg@5: 0.979067\n",
+ "[86]\tvalid_0's ndcg@1: 0.9452\tvalid_0's ndcg@2: 0.977172\tvalid_0's ndcg@3: 0.97876\tvalid_0's ndcg@4: 0.979115\tvalid_0's ndcg@5: 0.979135\n",
+ "[87]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.97879\tvalid_0's ndcg@4: 0.979156\tvalid_0's ndcg@5: 0.979166\n",
+ "[88]\tvalid_0's ndcg@1: 0.9453\tvalid_0's ndcg@2: 0.977178\tvalid_0's ndcg@3: 0.978815\tvalid_0's ndcg@4: 0.979149\tvalid_0's ndcg@5: 0.979168\n",
+ "[89]\tvalid_0's ndcg@1: 0.94555\tvalid_0's ndcg@2: 0.977333\tvalid_0's ndcg@3: 0.978933\tvalid_0's ndcg@4: 0.979267\tvalid_0's ndcg@5: 0.979277\n",
+ "[90]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977462\tvalid_0's ndcg@3: 0.979062\tvalid_0's ndcg@4: 0.979396\tvalid_0's ndcg@5: 0.979406\n",
+ "[91]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977481\tvalid_0's ndcg@3: 0.979081\tvalid_0's ndcg@4: 0.979414\tvalid_0's ndcg@5: 0.979424\n",
+ "[92]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977437\tvalid_0's ndcg@3: 0.97905\tvalid_0's ndcg@4: 0.979384\tvalid_0's ndcg@5: 0.979393\n",
+ "[93]\tvalid_0's ndcg@1: 0.945875\tvalid_0's ndcg@2: 0.977421\tvalid_0's ndcg@3: 0.979046\tvalid_0's ndcg@4: 0.97938\tvalid_0's ndcg@5: 0.97939\n",
+ "[94]\tvalid_0's ndcg@1: 0.9459\tvalid_0's ndcg@2: 0.977431\tvalid_0's ndcg@3: 0.979068\tvalid_0's ndcg@4: 0.979391\tvalid_0's ndcg@5: 0.979401\n",
+ "[95]\tvalid_0's ndcg@1: 0.94595\tvalid_0's ndcg@2: 0.977449\tvalid_0's ndcg@3: 0.979074\tvalid_0's ndcg@4: 0.979408\tvalid_0's ndcg@5: 0.979418\n",
+ "[96]\tvalid_0's ndcg@1: 0.946075\tvalid_0's ndcg@2: 0.977527\tvalid_0's ndcg@3: 0.979127\tvalid_0's ndcg@4: 0.979461\tvalid_0's ndcg@5: 0.97947\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[97]\tvalid_0's ndcg@1: 0.946375\tvalid_0's ndcg@2: 0.977622\tvalid_0's ndcg@3: 0.979222\tvalid_0's ndcg@4: 0.979577\tvalid_0's ndcg@5: 0.979577\n",
+ "[98]\tvalid_0's ndcg@1: 0.946625\tvalid_0's ndcg@2: 0.977714\tvalid_0's ndcg@3: 0.979339\tvalid_0's ndcg@4: 0.979673\tvalid_0's ndcg@5: 0.979673\n",
+ "[99]\tvalid_0's ndcg@1: 0.94665\tvalid_0's ndcg@2: 0.977739\tvalid_0's ndcg@3: 0.979352\tvalid_0's ndcg@4: 0.979685\tvalid_0's ndcg@5: 0.979685\n",
+ "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's ndcg@1: 0.946675\tvalid_0's ndcg@2: 0.97778\tvalid_0's ndcg@3: 0.97938\tvalid_0's ndcg@4: 0.979703\tvalid_0's ndcg@5: 0.979703\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
+ "# 这一部分与前面的单独训练和验证是分开的\n",
+ "def get_kfold_users(trn_df, n=5):\n",
+ " user_ids = trn_df['user_id'].unique()\n",
+ " user_set = [user_ids[i::n] for i in range(n)]\n",
+ " return user_set\n",
+ "\n",
+ "k_fold = 5\n",
+ "trn_df = trn_user_item_feats_df_rank_model\n",
+ "user_set = get_kfold_users(trn_df, n=k_fold)\n",
+ "\n",
+ "score_list = []\n",
+ "score_df = trn_df[['user_id', 'click_article_id','label']]\n",
+ "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
+ "\n",
+ "# 五折交叉验证,并将中间结果保存用于staking\n",
+ "for n_fold, valid_user in enumerate(user_set):\n",
+ " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
+ " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
+ " \n",
+ " # 训练集与验证集的用户分组\n",
+ " train_idx.sort_values(by=['user_id'], inplace=True)\n",
+ " g_train = train_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id'], inplace=True)\n",
+ " g_val = valid_idx.groupby(['user_id'], as_index=False).count()[\"label\"].values\n",
+ " \n",
+ " # 定义模型\n",
+ " lgb_ranker = lgb.LGBMRanker(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16) \n",
+ " # 训练模型\n",
+ " lgb_ranker.fit(train_idx[lgb_cols], train_idx['label'], group=g_train,\n",
+ " eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], eval_group= [g_val], \n",
+ " eval_at=[1, 2, 3, 4, 5], eval_metric=['ndcg', ], early_stopping_rounds=50, )\n",
+ " \n",
+ " # 预测验证集结果\n",
+ " valid_idx['pred_score'] = lgb_ranker.predict(valid_idx[lgb_cols], num_iteration=lgb_ranker.best_iteration_)\n",
+ " \n",
+ " # 对输出结果进行归一化\n",
+ " valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
+ " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
+ " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
+ " \n",
+ " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
+ " if not offline:\n",
+ " sub_preds += lgb_ranker.predict(tst_user_item_feats_df_rank_model[lgb_cols], lgb_ranker.best_iteration_)\n",
+ " \n",
+ "score_df_ = pd.concat(score_list, axis=0)\n",
+ "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
+ "# 保存训练集交叉验证产生的新特征\n",
+ "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_ranker_feats.csv', index=False)\n",
+ " \n",
+ "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
+ "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
+ "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ "\n",
+ "# 保存测试集交叉验证的新特征\n",
+ "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_ranker_feats.csv', index=False)"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Epoch 1/2\n",
- "290964/290964 [==============================] - 55s 189us/sample - loss: 0.4209 - binary_crossentropy: 0.4206 - auc: 0.7842\n",
- "Epoch 2/2\n",
- "290964/290964 [==============================] - 52s 178us/sample - loss: 0.3630 - binary_crossentropy: 0.3618 - auc: 0.8478\n"
- ]
- }
- ],
- "source": [
- "# 模型训练\n",
- "if offline:\n",
- " history = model.fit(x_trn, y_trn, verbose=1, epochs=10, validation_data=(x_val, y_val) , batch_size=256)\n",
- "else:\n",
- " # 也可以使用上面的语句用自己采样出来的验证集\n",
- " # history = model.fit(x_trn, y_trn, verbose=1, epochs=3, validation_split=0.3, batch_size=256)\n",
- " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, batch_size=256)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 32,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:29:20.436591Z",
- "start_time": "2020-11-18T04:28:58.102057Z"
- }
- },
- "outputs": [
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:22:52.604397Z",
+ "start_time": "2020-11-18T04:22:43.253034Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "# 单模型生成提交结果\n",
+ "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_ranker')"
+ ]
+ },
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "500000/500000 [==============================] - 20s 39us/sample\n"
- ]
- }
- ],
- "source": [
- "# 模型预测\n",
- "tst_user_item_feats_df_din_model['pred_score'] = model.predict(x_tst, verbose=1, batch_size=256)\n",
- "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'din_rank_score.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 33,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:29:34.985535Z",
- "start_time": "2020-11-18T04:29:26.264531Z"
- }
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']]\n",
- "submit(rank_results, topk=5, model_name='din')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-15T06:15:49.490705Z",
- "start_time": "2020-11-15T06:15:49.473794Z"
- }
- },
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "code",
- "execution_count": 34,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:38:53.760383Z",
- "start_time": "2020-11-18T04:29:51.737721Z"
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## LGB分类模型"
+ ]
},
- "scrolled": true
- },
- "outputs": [
{
- "name": "stdout",
- "output_type": "stream",
- "text": [
- "Train on 232681 samples, validate on 58283 samples\n",
- "Epoch 1/2\n",
- "232681/232681 [==============================] - 44s 189us/sample - loss: 0.2864 - binary_crossentropy: 0.2846 - auc: 0.9008 - val_loss: 0.2830 - val_binary_crossentropy: 0.2813 - val_auc: 0.9072\n",
- "Epoch 2/2\n",
- "232681/232681 [==============================] - 44s 187us/sample - loss: 0.2832 - binary_crossentropy: 0.2816 - auc: 0.9034 - val_loss: 0.2846 - val_binary_crossentropy: 0.2830 - val_auc: 0.9053\n",
- "58283/58283 [==============================] - 2s 36us/sample\n",
- "500000/500000 [==============================] - 19s 37us/sample\n",
- "Train on 232798 samples, validate on 58166 samples\n",
- "Epoch 1/2\n",
- "232798/232798 [==============================] - 43s 184us/sample - loss: 0.2818 - binary_crossentropy: 0.2802 - auc: 0.9051 - val_loss: 0.2968 - val_binary_crossentropy: 0.2953 - val_auc: 0.9062\n",
- "Epoch 2/2\n",
- "232798/232798 [==============================] - 44s 187us/sample - loss: 0.2796 - binary_crossentropy: 0.2782 - auc: 0.9069 - val_loss: 0.2820 - val_binary_crossentropy: 0.2806 - val_auc: 0.9071\n",
- "58166/58166 [==============================] - 2s 38us/sample\n",
- "500000/500000 [==============================] - 18s 37us/sample\n",
- "Train on 232847 samples, validate on 58117 samples\n",
- "Epoch 1/2\n",
- "232847/232847 [==============================] - 43s 185us/sample - loss: 0.2786 - binary_crossentropy: 0.2773 - auc: 0.9080 - val_loss: 0.2761 - val_binary_crossentropy: 0.2749 - val_auc: 0.9113\n",
- "Epoch 2/2\n",
- "232847/232847 [==============================] - 39s 166us/sample - loss: 0.2766 - binary_crossentropy: 0.2754 - auc: 0.9097 - val_loss: 0.2872 - val_binary_crossentropy: 0.2862 - val_auc: 0.9090\n",
- "58117/58117 [==============================] - 2s 34us/sample\n",
- "500000/500000 [==============================] - 17s 33us/sample\n",
- "Train on 232716 samples, validate on 58248 samples\n",
- "Epoch 1/2\n",
- "232716/232716 [==============================] - 39s 169us/sample - loss: 0.2763 - binary_crossentropy: 0.2753 - auc: 0.9100 - val_loss: 0.2739 - val_binary_crossentropy: 0.2730 - val_auc: 0.9116\n",
- "Epoch 2/2\n",
- "232716/232716 [==============================] - 39s 168us/sample - loss: 0.2743 - binary_crossentropy: 0.2735 - auc: 0.9119 - val_loss: 0.2859 - val_binary_crossentropy: 0.2851 - val_auc: 0.9090\n",
- "58248/58248 [==============================] - 2s 35us/sample\n",
- "500000/500000 [==============================] - 17s 34us/sample\n",
- "Train on 232814 samples, validate on 58150 samples\n",
- "Epoch 1/2\n",
- "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2747 - binary_crossentropy: 0.2739 - auc: 0.9115 - val_loss: 0.2702 - val_binary_crossentropy: 0.2695 - val_auc: 0.9163\n",
- "Epoch 2/2\n",
- "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2725 - binary_crossentropy: 0.2719 - auc: 0.9132 - val_loss: 0.2751 - val_binary_crossentropy: 0.2745 - val_auc: 0.9151\n",
- "58150/58150 [==============================] - 2s 34us/sample\n",
- "500000/500000 [==============================] - 17s 34us/sample\n"
- ]
- }
- ],
- "source": [
- "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
- "# 这一部分与前面的单独训练和验证是分开的\n",
- "def get_kfold_users(trn_df, n=5):\n",
- " user_ids = trn_df['user_id'].unique()\n",
- " user_set = [user_ids[i::n] for i in range(n)]\n",
- " return user_set\n",
- "\n",
- "k_fold = 5\n",
- "trn_df = trn_user_item_feats_df_din_model\n",
- "user_set = get_kfold_users(trn_df, n=k_fold)\n",
- "\n",
- "score_list = []\n",
- "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
- "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
- "\n",
- "dense_fea = [x for x in dense_fea if x != 'label']\n",
- "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- "\n",
- "# 五折交叉验证,并将中间结果保存用于staking\n",
- "for n_fold, valid_user in enumerate(user_set):\n",
- " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
- " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
- " \n",
- " # 准备训练数据\n",
- " x_trn, dnn_feature_columns = get_din_feats_columns(train_idx, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- " y_trn = train_idx['label'].values\n",
- "\n",
- " # 准备验证数据\n",
- " x_val, dnn_feature_columns = get_din_feats_columns(valid_idx, dense_fea, \n",
- " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
- " y_val = valid_idx['label'].values\n",
- " \n",
- " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, validation_data=(x_val, y_val) , batch_size=256)\n",
- " \n",
- " # 预测验证集结果\n",
- " valid_idx['pred_score'] = model.predict(x_val, verbose=1, batch_size=256) \n",
- " \n",
- " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
- " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- " \n",
- " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
- " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
- " \n",
- " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
- " if not offline:\n",
- " sub_preds += model.predict(x_tst, verbose=1, batch_size=256)[:, 0] \n",
- " \n",
- "score_df_ = pd.concat(score_list, axis=0)\n",
- "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
- "# 保存训练集交叉验证产生的新特征\n",
- "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_din_cls_feats.csv', index=False)\n",
- " \n",
- "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
- "tst_user_item_feats_df_din_model['pred_score'] = sub_preds / k_fold\n",
- "tst_user_item_feats_df_din_model['pred_score'] = tst_user_item_feats_df_din_model['pred_score'].transform(lambda x: norm_sim(x))\n",
- "tst_user_item_feats_df_din_model.sort_values(by=['user_id', 'pred_score'])\n",
- "tst_user_item_feats_df_din_model['pred_rank'] = tst_user_item_feats_df_din_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
- "\n",
- "# 保存测试集交叉验证的新特征\n",
- "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_din_cls_feats.csv', index=False)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "metadata": {},
- "outputs": [],
- "source": []
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 模型融合"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## 加权融合"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 35,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:27.351996Z",
- "start_time": "2020-11-18T04:44:26.561275Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取多个模型的排序结果文件\n",
- "lgb_ranker = pd.read_csv(save_path + 'lgb_ranker_score.csv')\n",
- "lgb_cls = pd.read_csv(save_path + 'lgb_cls_score.csv')\n",
- "din_ranker = pd.read_csv(save_path + 'din_rank_score.csv')\n",
- "\n",
- "# 这里也可以换成交叉验证输出的测试结果进行加权融合"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 36,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:31.593981Z",
- "start_time": "2020-11-18T04:44:31.589439Z"
- }
- },
- "outputs": [],
- "source": [
- "rank_model = {'lgb_ranker': lgb_ranker, \n",
- " 'lgb_cls': lgb_cls, \n",
- " 'din_ranker': din_ranker}"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 37,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:36.135860Z",
- "start_time": "2020-11-18T04:44:36.130577Z"
- }
- },
- "outputs": [],
- "source": [
- "def get_ensumble_predict_topk(rank_model, topk=5):\n",
- " final_recall = rank_model['lgb_cls'].append(rank_model['din_ranker'])\n",
- " rank_model['lgb_ranker']['pred_score'] = rank_model['lgb_ranker']['pred_score'].transform(lambda x: norm_sim(x))\n",
- " \n",
- " final_recall = final_recall.append(rank_model['lgb_ranker'])\n",
- " final_recall = final_recall.groupby(['user_id', 'click_article_id'])['pred_score'].sum().reset_index()\n",
- " \n",
- " submit(final_recall, topk=topk, model_name='ensemble_fuse')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 38,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:51.659270Z",
- "start_time": "2020-11-18T04:44:40.445659Z"
- }
- },
- "outputs": [],
- "source": [
- "get_ensumble_predict_topk(rank_model)"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Staking"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 39,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:44:58.025992Z",
- "start_time": "2020-11-18T04:44:56.146962Z"
- }
- },
- "outputs": [],
- "source": [
- "# 读取多个模型的交叉验证生成的结果文件\n",
- "# 训练集\n",
- "trn_lgb_ranker_feats = pd.read_csv(save_path + 'trn_lgb_ranker_feats.csv')\n",
- "trn_lgb_cls_feats = pd.read_csv(save_path + 'trn_lgb_cls_feats.csv')\n",
- "trn_din_cls_feats = pd.read_csv(save_path + 'trn_din_cls_feats.csv')\n",
- "\n",
- "# 测试集\n",
- "tst_lgb_ranker_feats = pd.read_csv(save_path + 'tst_lgb_ranker_feats.csv')\n",
- "tst_lgb_cls_feats = pd.read_csv(save_path + 'tst_lgb_cls_feats.csv')\n",
- "tst_din_cls_feats = pd.read_csv(save_path + 'tst_din_cls_feats.csv')"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 40,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:45:07.701862Z",
- "start_time": "2020-11-18T04:45:07.644335Z"
- }
- },
- "outputs": [],
- "source": [
- "# 将多个模型输出的特征进行拼接\n",
- "\n",
- "finall_trn_ranker_feats = trn_lgb_ranker_feats[['user_id', 'click_article_id', 'label']]\n",
- "finall_tst_ranker_feats = tst_lgb_ranker_feats[['user_id', 'click_article_id']]\n",
- "\n",
- "for idx, trn_model in enumerate([trn_lgb_ranker_feats, trn_lgb_cls_feats, trn_din_cls_feats]):\n",
- " for feat in [ 'pred_score', 'pred_rank']:\n",
- " col_name = feat + '_' + str(idx)\n",
- " finall_trn_ranker_feats[col_name] = trn_model[feat]\n",
- "\n",
- "for idx, tst_model in enumerate([tst_lgb_ranker_feats, tst_lgb_cls_feats, tst_din_cls_feats]):\n",
- " for feat in [ 'pred_score', 'pred_rank']:\n",
- " col_name = feat + '_' + str(idx)\n",
- " finall_tst_ranker_feats[col_name] = tst_model[feat]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 41,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:45:15.044242Z",
- "start_time": "2020-11-18T04:45:13.138252Z"
- }
- },
- "outputs": [],
- "source": [
- "# 定义一个逻辑回归模型再次拟合交叉验证产生的特征对测试集进行预测\n",
- "# 这里需要注意的是,在做交叉验证的时候可以构造多一些与输出预测值相关的特征,来丰富这里简单模型的特征\n",
- "from sklearn.linear_model import LogisticRegression\n",
- "\n",
- "feat_cols = ['pred_score_0', 'pred_rank_0', 'pred_score_1', 'pred_rank_1', 'pred_score_2', 'pred_rank_2']\n",
- "\n",
- "trn_x = finall_trn_ranker_feats[feat_cols]\n",
- "trn_y = finall_trn_ranker_feats['label']\n",
- "\n",
- "tst_x = finall_tst_ranker_feats[feat_cols]\n",
- "\n",
- "# 定义模型\n",
- "lr = LogisticRegression()\n",
- "\n",
- "# 模型训练\n",
- "lr.fit(trn_x, trn_y)\n",
- "\n",
- "# 模型预测\n",
- "finall_tst_ranker_feats['pred_score'] = lr.predict_proba(tst_x)[:, 1]"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": 42,
- "metadata": {
- "ExecuteTime": {
- "end_time": "2020-11-18T04:45:29.018764Z",
- "start_time": "2020-11-18T04:45:19.423130Z"
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:22:58.259730Z",
+ "start_time": "2020-11-18T04:22:58.254297Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型及参数的定义\n",
+ "lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=500, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:23:11.258774Z",
+ "start_time": "2020-11-18T04:23:00.861936Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型训练\n",
+ "if offline:\n",
+ " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'],\n",
+ " eval_set=[(val_user_item_feats_df_rank_model[lgb_cols], val_user_item_feats_df_rank_model['label'])], \n",
+ " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
+ "else:\n",
+ " lgb_Classfication.fit(trn_user_item_feats_df_rank_model[lgb_cols], trn_user_item_feats_df_rank_model['label'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:23:19.591396Z",
+ "start_time": "2020-11-18T04:23:13.813850Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 模型预测\n",
+ "tst_user_item_feats_df['pred_score'] = lgb_Classfication.predict_proba(tst_user_item_feats_df[lgb_cols])[:,1]\n",
+ "\n",
+ "# 将这里的排序结果保存一份,用户后面的模型融合\n",
+ "tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'lgb_cls_score.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:23:32.352931Z",
+ "start_time": "2020-11-18T04:23:22.346609Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_cls')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:24:11.241196Z",
+ "start_time": "2020-11-18T04:23:41.377394Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[1]\tvalid_0's auc: 0.764896\tvalid_0's binary_logloss: 0.522153\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.767857\tvalid_0's binary_logloss: 0.52057\n",
+ "[3]\tvalid_0's auc: 0.783096\tvalid_0's binary_logloss: 0.519584\n",
+ "[4]\tvalid_0's auc: 0.784354\tvalid_0's binary_logloss: 0.518485\n",
+ "[5]\tvalid_0's auc: 0.790554\tvalid_0's binary_logloss: 0.516886\n",
+ "[6]\tvalid_0's auc: 0.791954\tvalid_0's binary_logloss: 0.515334\n",
+ "[7]\tvalid_0's auc: 0.794257\tvalid_0's binary_logloss: 0.514032\n",
+ "[8]\tvalid_0's auc: 0.795222\tvalid_0's binary_logloss: 0.512516\n",
+ "[9]\tvalid_0's auc: 0.795417\tvalid_0's binary_logloss: 0.511671\n",
+ "[10]\tvalid_0's auc: 0.795913\tvalid_0's binary_logloss: 0.510226\n",
+ "[11]\tvalid_0's auc: 0.798222\tvalid_0's binary_logloss: 0.508858\n",
+ "[12]\tvalid_0's auc: 0.79825\tvalid_0's binary_logloss: 0.507928\n",
+ "[13]\tvalid_0's auc: 0.798842\tvalid_0's binary_logloss: 0.50708\n",
+ "[14]\tvalid_0's auc: 0.798935\tvalid_0's binary_logloss: 0.505752\n",
+ "[15]\tvalid_0's auc: 0.799543\tvalid_0's binary_logloss: 0.504388\n",
+ "[16]\tvalid_0's auc: 0.800844\tvalid_0's binary_logloss: 0.503126\n",
+ "[17]\tvalid_0's auc: 0.800855\tvalid_0's binary_logloss: 0.501809\n",
+ "[18]\tvalid_0's auc: 0.801653\tvalid_0's binary_logloss: 0.500676\n",
+ "[19]\tvalid_0's auc: 0.801518\tvalid_0's binary_logloss: 0.49987\n",
+ "[20]\tvalid_0's auc: 0.801662\tvalid_0's binary_logloss: 0.498625\n",
+ "[21]\tvalid_0's auc: 0.802093\tvalid_0's binary_logloss: 0.498113\n",
+ "[22]\tvalid_0's auc: 0.803071\tvalid_0's binary_logloss: 0.496933\n",
+ "[23]\tvalid_0's auc: 0.803222\tvalid_0's binary_logloss: 0.495864\n",
+ "[24]\tvalid_0's auc: 0.802927\tvalid_0's binary_logloss: 0.494691\n",
+ "[25]\tvalid_0's auc: 0.802581\tvalid_0's binary_logloss: 0.493543\n",
+ "[26]\tvalid_0's auc: 0.802965\tvalid_0's binary_logloss: 0.492444\n",
+ "[27]\tvalid_0's auc: 0.80298\tvalid_0's binary_logloss: 0.491336\n",
+ "[28]\tvalid_0's auc: 0.803226\tvalid_0's binary_logloss: 0.490275\n",
+ "[29]\tvalid_0's auc: 0.803436\tvalid_0's binary_logloss: 0.489126\n",
+ "[30]\tvalid_0's auc: 0.803796\tvalid_0's binary_logloss: 0.48802\n",
+ "[31]\tvalid_0's auc: 0.803601\tvalid_0's binary_logloss: 0.486988\n",
+ "[32]\tvalid_0's auc: 0.804416\tvalid_0's binary_logloss: 0.485972\n",
+ "[33]\tvalid_0's auc: 0.804529\tvalid_0's binary_logloss: 0.484939\n",
+ "[34]\tvalid_0's auc: 0.804534\tvalid_0's binary_logloss: 0.483927\n",
+ "[35]\tvalid_0's auc: 0.804819\tvalid_0's binary_logloss: 0.483271\n",
+ "[36]\tvalid_0's auc: 0.804774\tvalid_0's binary_logloss: 0.482273\n",
+ "[37]\tvalid_0's auc: 0.805237\tvalid_0's binary_logloss: 0.481639\n",
+ "[38]\tvalid_0's auc: 0.805546\tvalid_0's binary_logloss: 0.480959\n",
+ "[39]\tvalid_0's auc: 0.805598\tvalid_0's binary_logloss: 0.479955\n",
+ "[40]\tvalid_0's auc: 0.806011\tvalid_0's binary_logloss: 0.47903\n",
+ "[41]\tvalid_0's auc: 0.806664\tvalid_0's binary_logloss: 0.478439\n",
+ "[42]\tvalid_0's auc: 0.807021\tvalid_0's binary_logloss: 0.477798\n",
+ "[43]\tvalid_0's auc: 0.80726\tvalid_0's binary_logloss: 0.476829\n",
+ "[44]\tvalid_0's auc: 0.807157\tvalid_0's binary_logloss: 0.475976\n",
+ "[45]\tvalid_0's auc: 0.807788\tvalid_0's binary_logloss: 0.475056\n",
+ "[46]\tvalid_0's auc: 0.80805\tvalid_0's binary_logloss: 0.474446\n",
+ "[47]\tvalid_0's auc: 0.808097\tvalid_0's binary_logloss: 0.473576\n",
+ "[48]\tvalid_0's auc: 0.80815\tvalid_0's binary_logloss: 0.472676\n",
+ "[49]\tvalid_0's auc: 0.808304\tvalid_0's binary_logloss: 0.471918\n",
+ "[50]\tvalid_0's auc: 0.808749\tvalid_0's binary_logloss: 0.471481\n",
+ "[51]\tvalid_0's auc: 0.808972\tvalid_0's binary_logloss: 0.471104\n",
+ "[52]\tvalid_0's auc: 0.809326\tvalid_0's binary_logloss: 0.470289\n",
+ "[53]\tvalid_0's auc: 0.809472\tvalid_0's binary_logloss: 0.469508\n",
+ "[54]\tvalid_0's auc: 0.809505\tvalid_0's binary_logloss: 0.46869\n",
+ "[55]\tvalid_0's auc: 0.809594\tvalid_0's binary_logloss: 0.467885\n",
+ "[56]\tvalid_0's auc: 0.809847\tvalid_0's binary_logloss: 0.467356\n",
+ "[57]\tvalid_0's auc: 0.810262\tvalid_0's binary_logloss: 0.466531\n",
+ "[58]\tvalid_0's auc: 0.810407\tvalid_0's binary_logloss: 0.46573\n",
+ "[59]\tvalid_0's auc: 0.810618\tvalid_0's binary_logloss: 0.465205\n",
+ "[60]\tvalid_0's auc: 0.81066\tvalid_0's binary_logloss: 0.464435\n",
+ "[61]\tvalid_0's auc: 0.810638\tvalid_0's binary_logloss: 0.463721\n",
+ "[62]\tvalid_0's auc: 0.810658\tvalid_0's binary_logloss: 0.462982\n",
+ "[63]\tvalid_0's auc: 0.811106\tvalid_0's binary_logloss: 0.462246\n",
+ "[64]\tvalid_0's auc: 0.811313\tvalid_0's binary_logloss: 0.461748\n",
+ "[65]\tvalid_0's auc: 0.811351\tvalid_0's binary_logloss: 0.461038\n",
+ "[66]\tvalid_0's auc: 0.811433\tvalid_0's binary_logloss: 0.460323\n",
+ "[67]\tvalid_0's auc: 0.81158\tvalid_0's binary_logloss: 0.459662\n",
+ "[68]\tvalid_0's auc: 0.811561\tvalid_0's binary_logloss: 0.458988\n",
+ "[69]\tvalid_0's auc: 0.811748\tvalid_0's binary_logloss: 0.458592\n",
+ "[70]\tvalid_0's auc: 0.811919\tvalid_0's binary_logloss: 0.457934\n",
+ "[71]\tvalid_0's auc: 0.812073\tvalid_0's binary_logloss: 0.457508\n",
+ "[72]\tvalid_0's auc: 0.812273\tvalid_0's binary_logloss: 0.457038\n",
+ "[73]\tvalid_0's auc: 0.812561\tvalid_0's binary_logloss: 0.456439\n",
+ "[74]\tvalid_0's auc: 0.812633\tvalid_0's binary_logloss: 0.455789\n",
+ "[75]\tvalid_0's auc: 0.812757\tvalid_0's binary_logloss: 0.455173\n",
+ "[76]\tvalid_0's auc: 0.812923\tvalid_0's binary_logloss: 0.454533\n",
+ "[77]\tvalid_0's auc: 0.81295\tvalid_0's binary_logloss: 0.45392\n",
+ "[78]\tvalid_0's auc: 0.813073\tvalid_0's binary_logloss: 0.453517\n",
+ "[79]\tvalid_0's auc: 0.813202\tvalid_0's binary_logloss: 0.452932\n",
+ "[80]\tvalid_0's auc: 0.813611\tvalid_0's binary_logloss: 0.452285\n",
+ "[81]\tvalid_0's auc: 0.813769\tvalid_0's binary_logloss: 0.45191\n",
+ "[82]\tvalid_0's auc: 0.814468\tvalid_0's binary_logloss: 0.451455\n",
+ "[83]\tvalid_0's auc: 0.814656\tvalid_0's binary_logloss: 0.450885\n",
+ "[84]\tvalid_0's auc: 0.814755\tvalid_0's binary_logloss: 0.450308\n",
+ "[85]\tvalid_0's auc: 0.814824\tvalid_0's binary_logloss: 0.449739\n",
+ "[86]\tvalid_0's auc: 0.81499\tvalid_0's binary_logloss: 0.449348\n",
+ "[87]\tvalid_0's auc: 0.815232\tvalid_0's binary_logloss: 0.448759\n",
+ "[88]\tvalid_0's auc: 0.815452\tvalid_0's binary_logloss: 0.44823\n",
+ "[89]\tvalid_0's auc: 0.815593\tvalid_0's binary_logloss: 0.447861\n",
+ "[90]\tvalid_0's auc: 0.815591\tvalid_0's binary_logloss: 0.447323\n",
+ "[91]\tvalid_0's auc: 0.815672\tvalid_0's binary_logloss: 0.446796\n",
+ "[92]\tvalid_0's auc: 0.815875\tvalid_0's binary_logloss: 0.446472\n",
+ "[93]\tvalid_0's auc: 0.815984\tvalid_0's binary_logloss: 0.445961\n",
+ "[94]\tvalid_0's auc: 0.816026\tvalid_0's binary_logloss: 0.445439\n",
+ "[95]\tvalid_0's auc: 0.816172\tvalid_0's binary_logloss: 0.444909\n",
+ "[96]\tvalid_0's auc: 0.816321\tvalid_0's binary_logloss: 0.444413\n",
+ "[97]\tvalid_0's auc: 0.816751\tvalid_0's binary_logloss: 0.44405\n",
+ "[98]\tvalid_0's auc: 0.817226\tvalid_0's binary_logloss: 0.443626\n",
+ "[99]\tvalid_0's auc: 0.817286\tvalid_0's binary_logloss: 0.443136\n",
+ "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.817391\tvalid_0's binary_logloss: 0.442854\n",
+ "[1]\tvalid_0's auc: 0.771584\tvalid_0's binary_logloss: 0.527139\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.775446\tvalid_0's binary_logloss: 0.525462\n",
+ "[3]\tvalid_0's auc: 0.790092\tvalid_0's binary_logloss: 0.524461\n",
+ "[4]\tvalid_0's auc: 0.791432\tvalid_0's binary_logloss: 0.523322\n",
+ "[5]\tvalid_0's auc: 0.797482\tvalid_0's binary_logloss: 0.521614\n",
+ "[6]\tvalid_0's auc: 0.79893\tvalid_0's binary_logloss: 0.520007\n",
+ "[7]\tvalid_0's auc: 0.800753\tvalid_0's binary_logloss: 0.5187\n",
+ "[8]\tvalid_0's auc: 0.802197\tvalid_0's binary_logloss: 0.517125\n",
+ "[9]\tvalid_0's auc: 0.802828\tvalid_0's binary_logloss: 0.516269\n",
+ "[10]\tvalid_0's auc: 0.803496\tvalid_0's binary_logloss: 0.51474\n",
+ "[11]\tvalid_0's auc: 0.804972\tvalid_0's binary_logloss: 0.513321\n",
+ "[12]\tvalid_0's auc: 0.804995\tvalid_0's binary_logloss: 0.512334\n",
+ "[13]\tvalid_0's auc: 0.80525\tvalid_0's binary_logloss: 0.51151\n",
+ "[14]\tvalid_0's auc: 0.805026\tvalid_0's binary_logloss: 0.510149\n",
+ "[15]\tvalid_0's auc: 0.805622\tvalid_0's binary_logloss: 0.508708\n",
+ "[16]\tvalid_0's auc: 0.806974\tvalid_0's binary_logloss: 0.507384\n",
+ "[17]\tvalid_0's auc: 0.807045\tvalid_0's binary_logloss: 0.506017\n",
+ "[18]\tvalid_0's auc: 0.807265\tvalid_0's binary_logloss: 0.504853\n",
+ "[19]\tvalid_0's auc: 0.807126\tvalid_0's binary_logloss: 0.503972\n",
+ "[20]\tvalid_0's auc: 0.806948\tvalid_0's binary_logloss: 0.502693\n",
+ "[21]\tvalid_0's auc: 0.807315\tvalid_0's binary_logloss: 0.502166\n",
+ "[22]\tvalid_0's auc: 0.808067\tvalid_0's binary_logloss: 0.500948\n",
+ "[23]\tvalid_0's auc: 0.808226\tvalid_0's binary_logloss: 0.49987\n",
+ "[24]\tvalid_0's auc: 0.808268\tvalid_0's binary_logloss: 0.498623\n",
+ "[25]\tvalid_0's auc: 0.808569\tvalid_0's binary_logloss: 0.497389\n",
+ "[26]\tvalid_0's auc: 0.809069\tvalid_0's binary_logloss: 0.49624\n",
+ "[27]\tvalid_0's auc: 0.809312\tvalid_0's binary_logloss: 0.495095\n",
+ "[28]\tvalid_0's auc: 0.809549\tvalid_0's binary_logloss: 0.494012\n",
+ "[29]\tvalid_0's auc: 0.809944\tvalid_0's binary_logloss: 0.492834\n",
+ "[30]\tvalid_0's auc: 0.810047\tvalid_0's binary_logloss: 0.491735\n",
+ "[31]\tvalid_0's auc: 0.810086\tvalid_0's binary_logloss: 0.490633\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[32]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.489595\n",
+ "[33]\tvalid_0's auc: 0.810539\tvalid_0's binary_logloss: 0.488536\n",
+ "[34]\tvalid_0's auc: 0.810529\tvalid_0's binary_logloss: 0.487489\n",
+ "[35]\tvalid_0's auc: 0.810932\tvalid_0's binary_logloss: 0.486775\n",
+ "[36]\tvalid_0's auc: 0.810769\tvalid_0's binary_logloss: 0.48577\n",
+ "[37]\tvalid_0's auc: 0.811363\tvalid_0's binary_logloss: 0.485123\n",
+ "[38]\tvalid_0's auc: 0.811801\tvalid_0's binary_logloss: 0.484413\n",
+ "[39]\tvalid_0's auc: 0.811987\tvalid_0's binary_logloss: 0.483371\n",
+ "[40]\tvalid_0's auc: 0.812268\tvalid_0's binary_logloss: 0.482407\n",
+ "[41]\tvalid_0's auc: 0.813297\tvalid_0's binary_logloss: 0.481742\n",
+ "[42]\tvalid_0's auc: 0.813453\tvalid_0's binary_logloss: 0.481108\n",
+ "[43]\tvalid_0's auc: 0.813603\tvalid_0's binary_logloss: 0.480163\n",
+ "[44]\tvalid_0's auc: 0.813654\tvalid_0's binary_logloss: 0.479239\n",
+ "[45]\tvalid_0's auc: 0.814267\tvalid_0's binary_logloss: 0.478299\n",
+ "[46]\tvalid_0's auc: 0.81455\tvalid_0's binary_logloss: 0.477678\n",
+ "[47]\tvalid_0's auc: 0.81452\tvalid_0's binary_logloss: 0.476766\n",
+ "[48]\tvalid_0's auc: 0.814925\tvalid_0's binary_logloss: 0.475815\n",
+ "[49]\tvalid_0's auc: 0.814907\tvalid_0's binary_logloss: 0.47503\n",
+ "[50]\tvalid_0's auc: 0.815278\tvalid_0's binary_logloss: 0.474588\n",
+ "[51]\tvalid_0's auc: 0.815535\tvalid_0's binary_logloss: 0.474171\n",
+ "[52]\tvalid_0's auc: 0.815685\tvalid_0's binary_logloss: 0.473335\n",
+ "[53]\tvalid_0's auc: 0.815787\tvalid_0's binary_logloss: 0.472509\n",
+ "[54]\tvalid_0's auc: 0.815827\tvalid_0's binary_logloss: 0.471686\n",
+ "[55]\tvalid_0's auc: 0.815871\tvalid_0's binary_logloss: 0.470838\n",
+ "[56]\tvalid_0's auc: 0.816238\tvalid_0's binary_logloss: 0.470285\n",
+ "[57]\tvalid_0's auc: 0.816269\tvalid_0's binary_logloss: 0.469495\n",
+ "[58]\tvalid_0's auc: 0.816528\tvalid_0's binary_logloss: 0.468654\n",
+ "[59]\tvalid_0's auc: 0.816706\tvalid_0's binary_logloss: 0.468122\n",
+ "[60]\tvalid_0's auc: 0.816821\tvalid_0's binary_logloss: 0.467352\n",
+ "[61]\tvalid_0's auc: 0.816759\tvalid_0's binary_logloss: 0.466622\n",
+ "[62]\tvalid_0's auc: 0.81682\tvalid_0's binary_logloss: 0.465867\n",
+ "[63]\tvalid_0's auc: 0.817251\tvalid_0's binary_logloss: 0.465112\n",
+ "[64]\tvalid_0's auc: 0.817476\tvalid_0's binary_logloss: 0.464589\n",
+ "[65]\tvalid_0's auc: 0.817613\tvalid_0's binary_logloss: 0.463831\n",
+ "[66]\tvalid_0's auc: 0.817648\tvalid_0's binary_logloss: 0.463098\n",
+ "[67]\tvalid_0's auc: 0.817719\tvalid_0's binary_logloss: 0.462414\n",
+ "[68]\tvalid_0's auc: 0.817814\tvalid_0's binary_logloss: 0.461727\n",
+ "[69]\tvalid_0's auc: 0.817973\tvalid_0's binary_logloss: 0.461329\n",
+ "[70]\tvalid_0's auc: 0.818108\tvalid_0's binary_logloss: 0.460674\n",
+ "[71]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.460222\n",
+ "[72]\tvalid_0's auc: 0.818456\tvalid_0's binary_logloss: 0.45977\n",
+ "[73]\tvalid_0's auc: 0.818727\tvalid_0's binary_logloss: 0.459157\n",
+ "[74]\tvalid_0's auc: 0.818988\tvalid_0's binary_logloss: 0.458437\n",
+ "[75]\tvalid_0's auc: 0.819144\tvalid_0's binary_logloss: 0.457808\n",
+ "[76]\tvalid_0's auc: 0.819259\tvalid_0's binary_logloss: 0.457159\n",
+ "[77]\tvalid_0's auc: 0.819343\tvalid_0's binary_logloss: 0.456512\n",
+ "[78]\tvalid_0's auc: 0.81954\tvalid_0's binary_logloss: 0.456045\n",
+ "[79]\tvalid_0's auc: 0.819687\tvalid_0's binary_logloss: 0.455416\n",
+ "[80]\tvalid_0's auc: 0.819958\tvalid_0's binary_logloss: 0.454765\n",
+ "[81]\tvalid_0's auc: 0.820115\tvalid_0's binary_logloss: 0.45436\n",
+ "[82]\tvalid_0's auc: 0.820536\tvalid_0's binary_logloss: 0.453965\n",
+ "[83]\tvalid_0's auc: 0.820649\tvalid_0's binary_logloss: 0.453383\n",
+ "[84]\tvalid_0's auc: 0.820663\tvalid_0's binary_logloss: 0.452804\n",
+ "[85]\tvalid_0's auc: 0.820809\tvalid_0's binary_logloss: 0.452167\n",
+ "[86]\tvalid_0's auc: 0.821024\tvalid_0's binary_logloss: 0.451735\n",
+ "[87]\tvalid_0's auc: 0.821124\tvalid_0's binary_logloss: 0.451167\n",
+ "[88]\tvalid_0's auc: 0.821243\tvalid_0's binary_logloss: 0.45061\n",
+ "[89]\tvalid_0's auc: 0.821404\tvalid_0's binary_logloss: 0.450215\n",
+ "[90]\tvalid_0's auc: 0.821488\tvalid_0's binary_logloss: 0.449656\n",
+ "[91]\tvalid_0's auc: 0.821538\tvalid_0's binary_logloss: 0.449107\n",
+ "[92]\tvalid_0's auc: 0.82172\tvalid_0's binary_logloss: 0.448752\n",
+ "[93]\tvalid_0's auc: 0.821809\tvalid_0's binary_logloss: 0.448188\n",
+ "[94]\tvalid_0's auc: 0.82184\tvalid_0's binary_logloss: 0.447659\n",
+ "[95]\tvalid_0's auc: 0.821971\tvalid_0's binary_logloss: 0.447108\n",
+ "[96]\tvalid_0's auc: 0.822086\tvalid_0's binary_logloss: 0.446596\n",
+ "[97]\tvalid_0's auc: 0.82247\tvalid_0's binary_logloss: 0.446244\n",
+ "[98]\tvalid_0's auc: 0.822951\tvalid_0's binary_logloss: 0.445812\n",
+ "[99]\tvalid_0's auc: 0.822991\tvalid_0's binary_logloss: 0.445329\n",
+ "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.823174\tvalid_0's binary_logloss: 0.445037\n",
+ "[1]\tvalid_0's auc: 0.769525\tvalid_0's binary_logloss: 0.526256\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.775857\tvalid_0's binary_logloss: 0.524594\n",
+ "[3]\tvalid_0's auc: 0.785307\tvalid_0's binary_logloss: 0.523606\n",
+ "[4]\tvalid_0's auc: 0.786356\tvalid_0's binary_logloss: 0.522495\n",
+ "[5]\tvalid_0's auc: 0.793385\tvalid_0's binary_logloss: 0.520812\n",
+ "[6]\tvalid_0's auc: 0.794014\tvalid_0's binary_logloss: 0.519253\n",
+ "[7]\tvalid_0's auc: 0.795454\tvalid_0's binary_logloss: 0.517961\n",
+ "[8]\tvalid_0's auc: 0.79807\tvalid_0's binary_logloss: 0.516363\n",
+ "[9]\tvalid_0's auc: 0.798756\tvalid_0's binary_logloss: 0.51548\n",
+ "[10]\tvalid_0's auc: 0.798314\tvalid_0's binary_logloss: 0.514021\n",
+ "[11]\tvalid_0's auc: 0.799343\tvalid_0's binary_logloss: 0.512678\n",
+ "[12]\tvalid_0's auc: 0.799573\tvalid_0's binary_logloss: 0.511708\n",
+ "[13]\tvalid_0's auc: 0.799563\tvalid_0's binary_logloss: 0.510892\n",
+ "[14]\tvalid_0's auc: 0.800333\tvalid_0's binary_logloss: 0.509532\n",
+ "[15]\tvalid_0's auc: 0.800672\tvalid_0's binary_logloss: 0.508117\n",
+ "[16]\tvalid_0's auc: 0.801953\tvalid_0's binary_logloss: 0.506866\n",
+ "[17]\tvalid_0's auc: 0.802078\tvalid_0's binary_logloss: 0.5055\n",
+ "[18]\tvalid_0's auc: 0.802449\tvalid_0's binary_logloss: 0.504358\n",
+ "[19]\tvalid_0's auc: 0.802329\tvalid_0's binary_logloss: 0.503503\n",
+ "[20]\tvalid_0's auc: 0.802437\tvalid_0's binary_logloss: 0.502233\n",
+ "[21]\tvalid_0's auc: 0.802653\tvalid_0's binary_logloss: 0.50174\n",
+ "[22]\tvalid_0's auc: 0.803753\tvalid_0's binary_logloss: 0.50056\n",
+ "[23]\tvalid_0's auc: 0.803956\tvalid_0's binary_logloss: 0.499496\n",
+ "[24]\tvalid_0's auc: 0.804231\tvalid_0's binary_logloss: 0.498283\n",
+ "[25]\tvalid_0's auc: 0.804554\tvalid_0's binary_logloss: 0.497059\n",
+ "[26]\tvalid_0's auc: 0.805133\tvalid_0's binary_logloss: 0.495963\n",
+ "[27]\tvalid_0's auc: 0.805333\tvalid_0's binary_logloss: 0.494842\n",
+ "[28]\tvalid_0's auc: 0.805644\tvalid_0's binary_logloss: 0.493771\n",
+ "[29]\tvalid_0's auc: 0.806029\tvalid_0's binary_logloss: 0.492598\n",
+ "[30]\tvalid_0's auc: 0.806321\tvalid_0's binary_logloss: 0.491474\n",
+ "[31]\tvalid_0's auc: 0.806201\tvalid_0's binary_logloss: 0.490419\n",
+ "[32]\tvalid_0's auc: 0.806671\tvalid_0's binary_logloss: 0.489393\n",
+ "[33]\tvalid_0's auc: 0.806899\tvalid_0's binary_logloss: 0.488331\n",
+ "[34]\tvalid_0's auc: 0.807105\tvalid_0's binary_logloss: 0.487277\n",
+ "[35]\tvalid_0's auc: 0.807257\tvalid_0's binary_logloss: 0.486592\n",
+ "[36]\tvalid_0's auc: 0.80729\tvalid_0's binary_logloss: 0.485607\n",
+ "[37]\tvalid_0's auc: 0.807752\tvalid_0's binary_logloss: 0.484951\n",
+ "[38]\tvalid_0's auc: 0.808191\tvalid_0's binary_logloss: 0.484269\n",
+ "[39]\tvalid_0's auc: 0.808417\tvalid_0's binary_logloss: 0.483242\n",
+ "[40]\tvalid_0's auc: 0.808761\tvalid_0's binary_logloss: 0.482291\n",
+ "[41]\tvalid_0's auc: 0.80965\tvalid_0's binary_logloss: 0.48164\n",
+ "[42]\tvalid_0's auc: 0.810065\tvalid_0's binary_logloss: 0.480962\n",
+ "[43]\tvalid_0's auc: 0.810209\tvalid_0's binary_logloss: 0.479995\n",
+ "[44]\tvalid_0's auc: 0.810091\tvalid_0's binary_logloss: 0.479077\n",
+ "[45]\tvalid_0's auc: 0.810573\tvalid_0's binary_logloss: 0.478185\n",
+ "[46]\tvalid_0's auc: 0.810924\tvalid_0's binary_logloss: 0.477558\n",
+ "[47]\tvalid_0's auc: 0.810951\tvalid_0's binary_logloss: 0.476662\n",
+ "[48]\tvalid_0's auc: 0.811101\tvalid_0's binary_logloss: 0.475745\n",
+ "[49]\tvalid_0's auc: 0.811269\tvalid_0's binary_logloss: 0.474951\n",
+ "[50]\tvalid_0's auc: 0.81173\tvalid_0's binary_logloss: 0.474514\n",
+ "[51]\tvalid_0's auc: 0.811937\tvalid_0's binary_logloss: 0.474114\n",
+ "[52]\tvalid_0's auc: 0.812136\tvalid_0's binary_logloss: 0.473297\n",
+ "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.472497\n",
+ "[54]\tvalid_0's auc: 0.812121\tvalid_0's binary_logloss: 0.471696\n",
+ "[55]\tvalid_0's auc: 0.812164\tvalid_0's binary_logloss: 0.470905\n",
+ "[56]\tvalid_0's auc: 0.812462\tvalid_0's binary_logloss: 0.470384\n",
+ "[57]\tvalid_0's auc: 0.812613\tvalid_0's binary_logloss: 0.4696\n",
+ "[58]\tvalid_0's auc: 0.812615\tvalid_0's binary_logloss: 0.468778\n",
+ "[59]\tvalid_0's auc: 0.812842\tvalid_0's binary_logloss: 0.468211\n",
+ "[60]\tvalid_0's auc: 0.81312\tvalid_0's binary_logloss: 0.467385\n",
+ "[61]\tvalid_0's auc: 0.813039\tvalid_0's binary_logloss: 0.466632\n",
+ "[62]\tvalid_0's auc: 0.812942\tvalid_0's binary_logloss: 0.465933\n",
+ "[63]\tvalid_0's auc: 0.813274\tvalid_0's binary_logloss: 0.465214\n",
+ "[64]\tvalid_0's auc: 0.813572\tvalid_0's binary_logloss: 0.464692\n",
+ "[65]\tvalid_0's auc: 0.813594\tvalid_0's binary_logloss: 0.463925\n",
+ "[66]\tvalid_0's auc: 0.813719\tvalid_0's binary_logloss: 0.463177\n",
+ "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.462513\n",
+ "[68]\tvalid_0's auc: 0.813989\tvalid_0's binary_logloss: 0.461843\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[69]\tvalid_0's auc: 0.814218\tvalid_0's binary_logloss: 0.461443\n",
+ "[70]\tvalid_0's auc: 0.814334\tvalid_0's binary_logloss: 0.460775\n",
+ "[71]\tvalid_0's auc: 0.814493\tvalid_0's binary_logloss: 0.460332\n",
+ "[72]\tvalid_0's auc: 0.814663\tvalid_0's binary_logloss: 0.459867\n",
+ "[73]\tvalid_0's auc: 0.814856\tvalid_0's binary_logloss: 0.459266\n",
+ "[74]\tvalid_0's auc: 0.815017\tvalid_0's binary_logloss: 0.458585\n",
+ "[75]\tvalid_0's auc: 0.815186\tvalid_0's binary_logloss: 0.457958\n",
+ "[76]\tvalid_0's auc: 0.815374\tvalid_0's binary_logloss: 0.457316\n",
+ "[77]\tvalid_0's auc: 0.81554\tvalid_0's binary_logloss: 0.45665\n",
+ "[78]\tvalid_0's auc: 0.81569\tvalid_0's binary_logloss: 0.456217\n",
+ "[79]\tvalid_0's auc: 0.815861\tvalid_0's binary_logloss: 0.455615\n",
+ "[80]\tvalid_0's auc: 0.816443\tvalid_0's binary_logloss: 0.454895\n",
+ "[81]\tvalid_0's auc: 0.816659\tvalid_0's binary_logloss: 0.454503\n",
+ "[82]\tvalid_0's auc: 0.817017\tvalid_0's binary_logloss: 0.454149\n",
+ "[83]\tvalid_0's auc: 0.817162\tvalid_0's binary_logloss: 0.453578\n",
+ "[84]\tvalid_0's auc: 0.817274\tvalid_0's binary_logloss: 0.452984\n",
+ "[85]\tvalid_0's auc: 0.817283\tvalid_0's binary_logloss: 0.452416\n",
+ "[86]\tvalid_0's auc: 0.817339\tvalid_0's binary_logloss: 0.452022\n",
+ "[87]\tvalid_0's auc: 0.817494\tvalid_0's binary_logloss: 0.45146\n",
+ "[88]\tvalid_0's auc: 0.817594\tvalid_0's binary_logloss: 0.450926\n",
+ "[89]\tvalid_0's auc: 0.817771\tvalid_0's binary_logloss: 0.450553\n",
+ "[90]\tvalid_0's auc: 0.81789\tvalid_0's binary_logloss: 0.449985\n",
+ "[91]\tvalid_0's auc: 0.817931\tvalid_0's binary_logloss: 0.449439\n",
+ "[92]\tvalid_0's auc: 0.818138\tvalid_0's binary_logloss: 0.449094\n",
+ "[93]\tvalid_0's auc: 0.818334\tvalid_0's binary_logloss: 0.448527\n",
+ "[94]\tvalid_0's auc: 0.818426\tvalid_0's binary_logloss: 0.447989\n",
+ "[95]\tvalid_0's auc: 0.818676\tvalid_0's binary_logloss: 0.447407\n",
+ "[96]\tvalid_0's auc: 0.818852\tvalid_0's binary_logloss: 0.446884\n",
+ "[97]\tvalid_0's auc: 0.81945\tvalid_0's binary_logloss: 0.446455\n",
+ "[98]\tvalid_0's auc: 0.819861\tvalid_0's binary_logloss: 0.446045\n",
+ "[99]\tvalid_0's auc: 0.819943\tvalid_0's binary_logloss: 0.445543\n",
+ "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.820076\tvalid_0's binary_logloss: 0.445258\n",
+ "[1]\tvalid_0's auc: 0.770032\tvalid_0's binary_logloss: 0.527241\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.779881\tvalid_0's binary_logloss: 0.525545\n",
+ "[3]\tvalid_0's auc: 0.791308\tvalid_0's binary_logloss: 0.524508\n",
+ "[4]\tvalid_0's auc: 0.790788\tvalid_0's binary_logloss: 0.52341\n",
+ "[5]\tvalid_0's auc: 0.795645\tvalid_0's binary_logloss: 0.521753\n",
+ "[6]\tvalid_0's auc: 0.797745\tvalid_0's binary_logloss: 0.520131\n",
+ "[7]\tvalid_0's auc: 0.79931\tvalid_0's binary_logloss: 0.518872\n",
+ "[8]\tvalid_0's auc: 0.800014\tvalid_0's binary_logloss: 0.517353\n",
+ "[9]\tvalid_0's auc: 0.800549\tvalid_0's binary_logloss: 0.516487\n",
+ "[10]\tvalid_0's auc: 0.800261\tvalid_0's binary_logloss: 0.515039\n",
+ "[11]\tvalid_0's auc: 0.801261\tvalid_0's binary_logloss: 0.513695\n",
+ "[12]\tvalid_0's auc: 0.801062\tvalid_0's binary_logloss: 0.512735\n",
+ "[13]\tvalid_0's auc: 0.801155\tvalid_0's binary_logloss: 0.51192\n",
+ "[14]\tvalid_0's auc: 0.801315\tvalid_0's binary_logloss: 0.510559\n",
+ "[15]\tvalid_0's auc: 0.80185\tvalid_0's binary_logloss: 0.509147\n",
+ "[16]\tvalid_0's auc: 0.803029\tvalid_0's binary_logloss: 0.507914\n",
+ "[17]\tvalid_0's auc: 0.803035\tvalid_0's binary_logloss: 0.506583\n",
+ "[18]\tvalid_0's auc: 0.803433\tvalid_0's binary_logloss: 0.505441\n",
+ "[19]\tvalid_0's auc: 0.803717\tvalid_0's binary_logloss: 0.504599\n",
+ "[20]\tvalid_0's auc: 0.803819\tvalid_0's binary_logloss: 0.503327\n",
+ "[21]\tvalid_0's auc: 0.803923\tvalid_0's binary_logloss: 0.502782\n",
+ "[22]\tvalid_0's auc: 0.804939\tvalid_0's binary_logloss: 0.501596\n",
+ "[23]\tvalid_0's auc: 0.804707\tvalid_0's binary_logloss: 0.500572\n",
+ "[24]\tvalid_0's auc: 0.804632\tvalid_0's binary_logloss: 0.499367\n",
+ "[25]\tvalid_0's auc: 0.804756\tvalid_0's binary_logloss: 0.498161\n",
+ "[26]\tvalid_0's auc: 0.805067\tvalid_0's binary_logloss: 0.497061\n",
+ "[27]\tvalid_0's auc: 0.805119\tvalid_0's binary_logloss: 0.495933\n",
+ "[28]\tvalid_0's auc: 0.805304\tvalid_0's binary_logloss: 0.494849\n",
+ "[29]\tvalid_0's auc: 0.805688\tvalid_0's binary_logloss: 0.493677\n",
+ "[30]\tvalid_0's auc: 0.805822\tvalid_0's binary_logloss: 0.492594\n",
+ "[31]\tvalid_0's auc: 0.805869\tvalid_0's binary_logloss: 0.49152\n",
+ "[32]\tvalid_0's auc: 0.807267\tvalid_0's binary_logloss: 0.490435\n",
+ "[33]\tvalid_0's auc: 0.807301\tvalid_0's binary_logloss: 0.489392\n",
+ "[34]\tvalid_0's auc: 0.80736\tvalid_0's binary_logloss: 0.488325\n",
+ "[35]\tvalid_0's auc: 0.807706\tvalid_0's binary_logloss: 0.487654\n",
+ "[36]\tvalid_0's auc: 0.807758\tvalid_0's binary_logloss: 0.486651\n",
+ "[37]\tvalid_0's auc: 0.808051\tvalid_0's binary_logloss: 0.486012\n",
+ "[38]\tvalid_0's auc: 0.808429\tvalid_0's binary_logloss: 0.485355\n",
+ "[39]\tvalid_0's auc: 0.808663\tvalid_0's binary_logloss: 0.484327\n",
+ "[40]\tvalid_0's auc: 0.809007\tvalid_0's binary_logloss: 0.483386\n",
+ "[41]\tvalid_0's auc: 0.809781\tvalid_0's binary_logloss: 0.482745\n",
+ "[42]\tvalid_0's auc: 0.810071\tvalid_0's binary_logloss: 0.482124\n",
+ "[43]\tvalid_0's auc: 0.810383\tvalid_0's binary_logloss: 0.481154\n",
+ "[44]\tvalid_0's auc: 0.810446\tvalid_0's binary_logloss: 0.480243\n",
+ "[45]\tvalid_0's auc: 0.811148\tvalid_0's binary_logloss: 0.479261\n",
+ "[46]\tvalid_0's auc: 0.811245\tvalid_0's binary_logloss: 0.478687\n",
+ "[47]\tvalid_0's auc: 0.811214\tvalid_0's binary_logloss: 0.477812\n",
+ "[48]\tvalid_0's auc: 0.811408\tvalid_0's binary_logloss: 0.47689\n",
+ "[49]\tvalid_0's auc: 0.811486\tvalid_0's binary_logloss: 0.476132\n",
+ "[50]\tvalid_0's auc: 0.811806\tvalid_0's binary_logloss: 0.475718\n",
+ "[51]\tvalid_0's auc: 0.812017\tvalid_0's binary_logloss: 0.475342\n",
+ "[52]\tvalid_0's auc: 0.812255\tvalid_0's binary_logloss: 0.474505\n",
+ "[53]\tvalid_0's auc: 0.812249\tvalid_0's binary_logloss: 0.473707\n",
+ "[54]\tvalid_0's auc: 0.812235\tvalid_0's binary_logloss: 0.47289\n",
+ "[55]\tvalid_0's auc: 0.812233\tvalid_0's binary_logloss: 0.472091\n",
+ "[56]\tvalid_0's auc: 0.812492\tvalid_0's binary_logloss: 0.471563\n",
+ "[57]\tvalid_0's auc: 0.812579\tvalid_0's binary_logloss: 0.47077\n",
+ "[58]\tvalid_0's auc: 0.812598\tvalid_0's binary_logloss: 0.469992\n",
+ "[59]\tvalid_0's auc: 0.812885\tvalid_0's binary_logloss: 0.469458\n",
+ "[60]\tvalid_0's auc: 0.812995\tvalid_0's binary_logloss: 0.468676\n",
+ "[61]\tvalid_0's auc: 0.812961\tvalid_0's binary_logloss: 0.467939\n",
+ "[62]\tvalid_0's auc: 0.812919\tvalid_0's binary_logloss: 0.467232\n",
+ "[63]\tvalid_0's auc: 0.813291\tvalid_0's binary_logloss: 0.466491\n",
+ "[64]\tvalid_0's auc: 0.813702\tvalid_0's binary_logloss: 0.465945\n",
+ "[65]\tvalid_0's auc: 0.813803\tvalid_0's binary_logloss: 0.465197\n",
+ "[66]\tvalid_0's auc: 0.813851\tvalid_0's binary_logloss: 0.4645\n",
+ "[67]\tvalid_0's auc: 0.814011\tvalid_0's binary_logloss: 0.463814\n",
+ "[68]\tvalid_0's auc: 0.814027\tvalid_0's binary_logloss: 0.463113\n",
+ "[69]\tvalid_0's auc: 0.814138\tvalid_0's binary_logloss: 0.462727\n",
+ "[70]\tvalid_0's auc: 0.814365\tvalid_0's binary_logloss: 0.462077\n",
+ "[71]\tvalid_0's auc: 0.814432\tvalid_0's binary_logloss: 0.461655\n",
+ "[72]\tvalid_0's auc: 0.8146\tvalid_0's binary_logloss: 0.461194\n",
+ "[73]\tvalid_0's auc: 0.815324\tvalid_0's binary_logloss: 0.460477\n",
+ "[74]\tvalid_0's auc: 0.815411\tvalid_0's binary_logloss: 0.459805\n",
+ "[75]\tvalid_0's auc: 0.815548\tvalid_0's binary_logloss: 0.459189\n",
+ "[76]\tvalid_0's auc: 0.815625\tvalid_0's binary_logloss: 0.458525\n",
+ "[77]\tvalid_0's auc: 0.81562\tvalid_0's binary_logloss: 0.457905\n",
+ "[78]\tvalid_0's auc: 0.815786\tvalid_0's binary_logloss: 0.45747\n",
+ "[79]\tvalid_0's auc: 0.815834\tvalid_0's binary_logloss: 0.456884\n",
+ "[80]\tvalid_0's auc: 0.816475\tvalid_0's binary_logloss: 0.45617\n",
+ "[81]\tvalid_0's auc: 0.816677\tvalid_0's binary_logloss: 0.455787\n",
+ "[82]\tvalid_0's auc: 0.817255\tvalid_0's binary_logloss: 0.455358\n",
+ "[83]\tvalid_0's auc: 0.817383\tvalid_0's binary_logloss: 0.454775\n",
+ "[84]\tvalid_0's auc: 0.817509\tvalid_0's binary_logloss: 0.454176\n",
+ "[85]\tvalid_0's auc: 0.817572\tvalid_0's binary_logloss: 0.453609\n",
+ "[86]\tvalid_0's auc: 0.817721\tvalid_0's binary_logloss: 0.453213\n",
+ "[87]\tvalid_0's auc: 0.817992\tvalid_0's binary_logloss: 0.452586\n",
+ "[88]\tvalid_0's auc: 0.81808\tvalid_0's binary_logloss: 0.45204\n",
+ "[89]\tvalid_0's auc: 0.818202\tvalid_0's binary_logloss: 0.451643\n",
+ "[90]\tvalid_0's auc: 0.818336\tvalid_0's binary_logloss: 0.451081\n",
+ "[91]\tvalid_0's auc: 0.818347\tvalid_0's binary_logloss: 0.450531\n",
+ "[92]\tvalid_0's auc: 0.818558\tvalid_0's binary_logloss: 0.450179\n",
+ "[93]\tvalid_0's auc: 0.818743\tvalid_0's binary_logloss: 0.449647\n",
+ "[94]\tvalid_0's auc: 0.818789\tvalid_0's binary_logloss: 0.449133\n",
+ "[95]\tvalid_0's auc: 0.818849\tvalid_0's binary_logloss: 0.44862\n",
+ "[96]\tvalid_0's auc: 0.81913\tvalid_0's binary_logloss: 0.448072\n",
+ "[97]\tvalid_0's auc: 0.819526\tvalid_0's binary_logloss: 0.447713\n",
+ "[98]\tvalid_0's auc: 0.819971\tvalid_0's binary_logloss: 0.447296\n",
+ "[99]\tvalid_0's auc: 0.819972\tvalid_0's binary_logloss: 0.446814\n"
+ ]
+ },
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.820086\tvalid_0's binary_logloss: 0.446533\n",
+ "[1]\tvalid_0's auc: 0.768646\tvalid_0's binary_logloss: 0.527167\n",
+ "Training until validation scores don't improve for 50 rounds\n",
+ "[2]\tvalid_0's auc: 0.779902\tvalid_0's binary_logloss: 0.525481\n",
+ "[3]\tvalid_0's auc: 0.789868\tvalid_0's binary_logloss: 0.524485\n",
+ "[4]\tvalid_0's auc: 0.791895\tvalid_0's binary_logloss: 0.523382\n",
+ "[5]\tvalid_0's auc: 0.795453\tvalid_0's binary_logloss: 0.521759\n",
+ "[6]\tvalid_0's auc: 0.796672\tvalid_0's binary_logloss: 0.520166\n",
+ "[7]\tvalid_0's auc: 0.798023\tvalid_0's binary_logloss: 0.518857\n",
+ "[8]\tvalid_0's auc: 0.799331\tvalid_0's binary_logloss: 0.517297\n",
+ "[9]\tvalid_0's auc: 0.800181\tvalid_0's binary_logloss: 0.516416\n",
+ "[10]\tvalid_0's auc: 0.800373\tvalid_0's binary_logloss: 0.514967\n",
+ "[11]\tvalid_0's auc: 0.801087\tvalid_0's binary_logloss: 0.513631\n",
+ "[12]\tvalid_0's auc: 0.801122\tvalid_0's binary_logloss: 0.512658\n",
+ "[13]\tvalid_0's auc: 0.801043\tvalid_0's binary_logloss: 0.511833\n",
+ "[14]\tvalid_0's auc: 0.801238\tvalid_0's binary_logloss: 0.510461\n",
+ "[15]\tvalid_0's auc: 0.801847\tvalid_0's binary_logloss: 0.509034\n",
+ "[16]\tvalid_0's auc: 0.803139\tvalid_0's binary_logloss: 0.507759\n",
+ "[17]\tvalid_0's auc: 0.803577\tvalid_0's binary_logloss: 0.506361\n",
+ "[18]\tvalid_0's auc: 0.803834\tvalid_0's binary_logloss: 0.505229\n",
+ "[19]\tvalid_0's auc: 0.803943\tvalid_0's binary_logloss: 0.504371\n",
+ "[20]\tvalid_0's auc: 0.80415\tvalid_0's binary_logloss: 0.503102\n",
+ "[21]\tvalid_0's auc: 0.804446\tvalid_0's binary_logloss: 0.502564\n",
+ "[22]\tvalid_0's auc: 0.805163\tvalid_0's binary_logloss: 0.501396\n",
+ "[23]\tvalid_0's auc: 0.805323\tvalid_0's binary_logloss: 0.500327\n",
+ "[24]\tvalid_0's auc: 0.805314\tvalid_0's binary_logloss: 0.499123\n",
+ "[25]\tvalid_0's auc: 0.80535\tvalid_0's binary_logloss: 0.497927\n",
+ "[26]\tvalid_0's auc: 0.805864\tvalid_0's binary_logloss: 0.496834\n",
+ "[27]\tvalid_0's auc: 0.805919\tvalid_0's binary_logloss: 0.495667\n",
+ "[28]\tvalid_0's auc: 0.806272\tvalid_0's binary_logloss: 0.494606\n",
+ "[29]\tvalid_0's auc: 0.806599\tvalid_0's binary_logloss: 0.49343\n",
+ "[30]\tvalid_0's auc: 0.806932\tvalid_0's binary_logloss: 0.492303\n",
+ "[31]\tvalid_0's auc: 0.806656\tvalid_0's binary_logloss: 0.491249\n",
+ "[32]\tvalid_0's auc: 0.807436\tvalid_0's binary_logloss: 0.490188\n",
+ "[33]\tvalid_0's auc: 0.807629\tvalid_0's binary_logloss: 0.489117\n",
+ "[34]\tvalid_0's auc: 0.807501\tvalid_0's binary_logloss: 0.48808\n",
+ "[35]\tvalid_0's auc: 0.807885\tvalid_0's binary_logloss: 0.487383\n",
+ "[36]\tvalid_0's auc: 0.807921\tvalid_0's binary_logloss: 0.48636\n",
+ "[37]\tvalid_0's auc: 0.808267\tvalid_0's binary_logloss: 0.485724\n",
+ "[38]\tvalid_0's auc: 0.808563\tvalid_0's binary_logloss: 0.485076\n",
+ "[39]\tvalid_0's auc: 0.808813\tvalid_0's binary_logloss: 0.484039\n",
+ "[40]\tvalid_0's auc: 0.809023\tvalid_0's binary_logloss: 0.483091\n",
+ "[41]\tvalid_0's auc: 0.809782\tvalid_0's binary_logloss: 0.482441\n",
+ "[42]\tvalid_0's auc: 0.810135\tvalid_0's binary_logloss: 0.48179\n",
+ "[43]\tvalid_0's auc: 0.810219\tvalid_0's binary_logloss: 0.48082\n",
+ "[44]\tvalid_0's auc: 0.81031\tvalid_0's binary_logloss: 0.479906\n",
+ "[45]\tvalid_0's auc: 0.810514\tvalid_0's binary_logloss: 0.479024\n",
+ "[46]\tvalid_0's auc: 0.810566\tvalid_0's binary_logloss: 0.478437\n",
+ "[47]\tvalid_0's auc: 0.810611\tvalid_0's binary_logloss: 0.477529\n",
+ "[48]\tvalid_0's auc: 0.810781\tvalid_0's binary_logloss: 0.476637\n",
+ "[49]\tvalid_0's auc: 0.81089\tvalid_0's binary_logloss: 0.475883\n",
+ "[50]\tvalid_0's auc: 0.811266\tvalid_0's binary_logloss: 0.475459\n",
+ "[51]\tvalid_0's auc: 0.811402\tvalid_0's binary_logloss: 0.475078\n",
+ "[52]\tvalid_0's auc: 0.811765\tvalid_0's binary_logloss: 0.474246\n",
+ "[53]\tvalid_0's auc: 0.811891\tvalid_0's binary_logloss: 0.473452\n",
+ "[54]\tvalid_0's auc: 0.811868\tvalid_0's binary_logloss: 0.47263\n",
+ "[55]\tvalid_0's auc: 0.81192\tvalid_0's binary_logloss: 0.471804\n",
+ "[56]\tvalid_0's auc: 0.812272\tvalid_0's binary_logloss: 0.471275\n",
+ "[57]\tvalid_0's auc: 0.812639\tvalid_0's binary_logloss: 0.470396\n",
+ "[58]\tvalid_0's auc: 0.812764\tvalid_0's binary_logloss: 0.469597\n",
+ "[59]\tvalid_0's auc: 0.813084\tvalid_0's binary_logloss: 0.469049\n",
+ "[60]\tvalid_0's auc: 0.813342\tvalid_0's binary_logloss: 0.468244\n",
+ "[61]\tvalid_0's auc: 0.813302\tvalid_0's binary_logloss: 0.467499\n",
+ "[62]\tvalid_0's auc: 0.813221\tvalid_0's binary_logloss: 0.466758\n",
+ "[63]\tvalid_0's auc: 0.813697\tvalid_0's binary_logloss: 0.466017\n",
+ "[64]\tvalid_0's auc: 0.813985\tvalid_0's binary_logloss: 0.465501\n",
+ "[65]\tvalid_0's auc: 0.81416\tvalid_0's binary_logloss: 0.464725\n",
+ "[66]\tvalid_0's auc: 0.814227\tvalid_0's binary_logloss: 0.46398\n",
+ "[67]\tvalid_0's auc: 0.814397\tvalid_0's binary_logloss: 0.463309\n",
+ "[68]\tvalid_0's auc: 0.814426\tvalid_0's binary_logloss: 0.462627\n",
+ "[69]\tvalid_0's auc: 0.814593\tvalid_0's binary_logloss: 0.462244\n",
+ "[70]\tvalid_0's auc: 0.814789\tvalid_0's binary_logloss: 0.461571\n",
+ "[71]\tvalid_0's auc: 0.814889\tvalid_0's binary_logloss: 0.461144\n",
+ "[72]\tvalid_0's auc: 0.815078\tvalid_0's binary_logloss: 0.460684\n",
+ "[73]\tvalid_0's auc: 0.815439\tvalid_0's binary_logloss: 0.460063\n",
+ "[74]\tvalid_0's auc: 0.815511\tvalid_0's binary_logloss: 0.459386\n",
+ "[75]\tvalid_0's auc: 0.815574\tvalid_0's binary_logloss: 0.45877\n",
+ "[76]\tvalid_0's auc: 0.815634\tvalid_0's binary_logloss: 0.458128\n",
+ "[77]\tvalid_0's auc: 0.815618\tvalid_0's binary_logloss: 0.457495\n",
+ "[78]\tvalid_0's auc: 0.81582\tvalid_0's binary_logloss: 0.457057\n",
+ "[79]\tvalid_0's auc: 0.81594\tvalid_0's binary_logloss: 0.456475\n",
+ "[80]\tvalid_0's auc: 0.815961\tvalid_0's binary_logloss: 0.455885\n",
+ "[81]\tvalid_0's auc: 0.816153\tvalid_0's binary_logloss: 0.455511\n",
+ "[82]\tvalid_0's auc: 0.816433\tvalid_0's binary_logloss: 0.455186\n",
+ "[83]\tvalid_0's auc: 0.816546\tvalid_0's binary_logloss: 0.454625\n",
+ "[84]\tvalid_0's auc: 0.816586\tvalid_0's binary_logloss: 0.454039\n",
+ "[85]\tvalid_0's auc: 0.816584\tvalid_0's binary_logloss: 0.453482\n",
+ "[86]\tvalid_0's auc: 0.816881\tvalid_0's binary_logloss: 0.453048\n",
+ "[87]\tvalid_0's auc: 0.817029\tvalid_0's binary_logloss: 0.452485\n",
+ "[88]\tvalid_0's auc: 0.81707\tvalid_0's binary_logloss: 0.451941\n",
+ "[89]\tvalid_0's auc: 0.817298\tvalid_0's binary_logloss: 0.451544\n",
+ "[90]\tvalid_0's auc: 0.817343\tvalid_0's binary_logloss: 0.450975\n",
+ "[91]\tvalid_0's auc: 0.817357\tvalid_0's binary_logloss: 0.450422\n",
+ "[92]\tvalid_0's auc: 0.817592\tvalid_0's binary_logloss: 0.450109\n",
+ "[93]\tvalid_0's auc: 0.817729\tvalid_0's binary_logloss: 0.449542\n",
+ "[94]\tvalid_0's auc: 0.817834\tvalid_0's binary_logloss: 0.448982\n",
+ "[95]\tvalid_0's auc: 0.81809\tvalid_0's binary_logloss: 0.448398\n",
+ "[96]\tvalid_0's auc: 0.818269\tvalid_0's binary_logloss: 0.447908\n",
+ "[97]\tvalid_0's auc: 0.818682\tvalid_0's binary_logloss: 0.447547\n",
+ "[98]\tvalid_0's auc: 0.819015\tvalid_0's binary_logloss: 0.447165\n",
+ "[99]\tvalid_0's auc: 0.819016\tvalid_0's binary_logloss: 0.446669\n",
+ "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n",
+ "Did not meet early stopping. Best iteration is:\n",
+ "[100]\tvalid_0's auc: 0.819127\tvalid_0's binary_logloss: 0.446397\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
+ "# 这一部分与前面的单独训练和验证是分开的\n",
+ "def get_kfold_users(trn_df, n=5):\n",
+ " user_ids = trn_df['user_id'].unique()\n",
+ " user_set = [user_ids[i::n] for i in range(n)]\n",
+ " return user_set\n",
+ "\n",
+ "k_fold = 5\n",
+ "trn_df = trn_user_item_feats_df_rank_model\n",
+ "user_set = get_kfold_users(trn_df, n=k_fold)\n",
+ "\n",
+ "score_list = []\n",
+ "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
+ "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
+ "\n",
+ "# 五折交叉验证,并将中间结果保存用于staking\n",
+ "for n_fold, valid_user in enumerate(user_set):\n",
+ " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
+ " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
+ " \n",
+ " # 模型及参数的定义\n",
+ " lgb_Classfication = lgb.LGBMClassifier(boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,\n",
+ " max_depth=-1, n_estimators=100, subsample=0.7, colsample_bytree=0.7, subsample_freq=1,\n",
+ " learning_rate=0.01, min_child_weight=50, random_state=2018, n_jobs= 16, verbose=10) \n",
+ " # 训练模型\n",
+ " lgb_Classfication.fit(train_idx[lgb_cols], train_idx['label'],eval_set=[(valid_idx[lgb_cols], valid_idx['label'])], \n",
+ " eval_metric=['auc', ],early_stopping_rounds=50, )\n",
+ " \n",
+ " # 预测验证集结果\n",
+ " valid_idx['pred_score'] = lgb_Classfication.predict_proba(valid_idx[lgb_cols], \n",
+ " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
+ " \n",
+ " # 对输出结果进行归一化 分类模型输出的值本身就是一个概率值不需要进行归一化\n",
+ " # valid_idx['pred_score'] = valid_idx[['pred_score']].transform(lambda x: norm_sim(x))\n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
+ " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
+ " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
+ " \n",
+ " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
+ " if not offline:\n",
+ " sub_preds += lgb_Classfication.predict_proba(tst_user_item_feats_df_rank_model[lgb_cols], \n",
+ " num_iteration=lgb_Classfication.best_iteration_)[:,1]\n",
+ " \n",
+ "score_df_ = pd.concat(score_list, axis=0)\n",
+ "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
+ "# 保存训练集交叉验证产生的新特征\n",
+ "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_lgb_cls_feats.csv', index=False)\n",
+ " \n",
+ "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = sub_preds / k_fold\n",
+ "tst_user_item_feats_df_rank_model['pred_score'] = tst_user_item_feats_df_rank_model['pred_score'].transform(lambda x: norm_sim(x))\n",
+ "tst_user_item_feats_df_rank_model.sort_values(by=['user_id', 'pred_score'])\n",
+ "tst_user_item_feats_df_rank_model['pred_rank'] = tst_user_item_feats_df_rank_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ "\n",
+ "# 保存测试集交叉验证的新特征\n",
+ "tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_lgb_cls_feats.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:24:23.074237Z",
+ "start_time": "2020-11-18T04:24:13.812284Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df_rank_model[['user_id', 'click_article_id', 'pred_score']]\n",
+ "rank_results['click_article_id'] = rank_results['click_article_id'].astype(int)\n",
+ "submit(rank_results, topk=5, model_name='lgb_cls')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## DIN模型"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### 用户的历史点击行为列表\n",
+ "这个是为后面的DIN模型服务的"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:24:30.508213Z",
+ "start_time": "2020-11-18T04:24:27.426372Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "if offline:\n",
+ " all_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
+ "else:\n",
+ " trn_data = pd.read_csv('./data_raw/train_click_log.csv')\n",
+ " tst_data = pd.read_csv('./data_raw/testA_click_log.csv')\n",
+ " all_data = trn_data.append(tst_data)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 22,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:25:28.082071Z",
+ "start_time": "2020-11-18T04:24:33.649524Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "hist_click =all_data[['user_id', 'click_article_id']].groupby('user_id').agg({list}).reset_index()\n",
+ "his_behavior_df = pd.DataFrame()\n",
+ "his_behavior_df['user_id'] = hist_click['user_id']\n",
+ "his_behavior_df['hist_click_article_id'] = hist_click['click_article_id']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 23,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:25:52.925866Z",
+ "start_time": "2020-11-18T04:25:52.863922Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_feats_df_din_model = trn_user_item_feats_df.copy()\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_din_model = val_user_item_feats_df.copy()\n",
+ "else: \n",
+ " val_user_item_feats_df_din_model = None\n",
+ " \n",
+ "tst_user_item_feats_df_din_model = tst_user_item_feats_df.copy()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 24,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:00.070681Z",
+ "start_time": "2020-11-18T04:25:56.417197Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "trn_user_item_feats_df_din_model = trn_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
+ "\n",
+ "if offline:\n",
+ " val_user_item_feats_df_din_model = val_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')\n",
+ "else:\n",
+ " val_user_item_feats_df_din_model = None\n",
+ "\n",
+ "tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_behavior_df, on='user_id')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### DIN模型简介\n",
+ "我们下面尝试使用DIN模型, DIN的全称是Deep Interest Network, 这是阿里2018年基于前面的深度学习模型无法表达用户多样化的兴趣而提出的一个模型, 它可以通过考虑【给定的候选广告】和【用户的历史行为】的相关性,来计算用户兴趣的表示向量。具体来说就是通过引入局部激活单元,通过软搜索历史行为的相关部分来关注相关的用户兴趣,并采用加权和来获得有关候选广告的用户兴趣的表示。与候选广告相关性较高的行为会获得较高的激活权重,并支配着用户兴趣。该表示向量在不同广告上有所不同,大大提高了模型的表达能力。所以该模型对于此次新闻推荐的任务也比较适合, 我们在这里通过当前的候选文章与用户历史点击文章的相关性来计算用户对于文章的兴趣。 该模型的结构如下:\n",
+ "\n",
+ "![image-20201116201646983](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)\n",
+ "\n",
+ "\n",
+ "我们这里直接调包来使用这个模型, 关于这个模型的详细细节部分我们会在下一期的推荐系统组队学习中给出。下面说一下该模型如何具体使用:deepctr的函数原型如下:\n",
+ "> def DIN(dnn_feature_columns, history_feature_list, dnn_use_bn=False,\n",
+ "> dnn_hidden_units=(200, 80), dnn_activation='relu', att_hidden_size=(80, 40), att_activation=\"dice\",\n",
+ "> att_weight_normalization=False, l2_reg_dnn=0, l2_reg_embedding=1e-6, dnn_dropout=0, seed=1024,\n",
+ "> task='binary'):\n",
+ "> \n",
+ "> * dnn_feature_columns: 特征列, 包含数据所有特征的列表\n",
+ "> * history_feature_list: 用户历史行为列, 反应用户历史行为的特征的列表\n",
+ "> * dnn_use_bn: 是否使用BatchNormalization\n",
+ "> * dnn_hidden_units: 全连接层网络的层数和每一层神经元的个数, 一个列表或者元组\n",
+ "> * dnn_activation_relu: 全连接网络的激活单元类型\n",
+ "> * att_hidden_size: 注意力层的全连接网络的层数和每一层神经元的个数\n",
+ "> * att_activation: 注意力层的激活单元类型\n",
+ "> * att_weight_normalization: 是否归一化注意力得分\n",
+ "> * l2_reg_dnn: 全连接网络的正则化系数\n",
+ "> * l2_reg_embedding: embedding向量的正则化稀疏\n",
+ "> * dnn_dropout: 全连接网络的神经元的失活概率\n",
+ "> * task: 任务, 可以是分类, 也可是是回归\n",
+ "\n",
+ "在具体使用的时候, 我们必须要传入特征列和历史行为列, 但是再传入之前, 我们需要进行一下特征列的预处理。具体如下:\n",
+ "\n",
+ "1. 首先,我们要处理数据集, 得到数据, 由于我们是基于用户过去的行为去预测用户是否点击当前文章, 所以我们需要把数据的特征列划分成数值型特征, 离散型特征和历史行为特征列三部分, 对于每一部分, DIN模型的处理会有不同\n",
+ " 1. 对于离散型特征, 在我们的数据集中就是那些类别型的特征, 比如user_id这种, 这种类别型特征, 我们首先要经过embedding处理得到每个特征的低维稠密型表示, 既然要经过embedding, 那么我们就需要为每一列的类别特征的取值建立一个字典,并指明embedding维度, 所以在使用deepctr的DIN模型准备数据的时候, 我们需要通过SparseFeat函数指明这些类别型特征, 这个函数的传入参数就是列名, 列的唯一取值(建立字典用)和embedding维度。\n",
+ " 2. 对于用户历史行为特征列, 比如文章id, 文章的类别等这种, 同样的我们需要先经过embedding处理, 只不过和上面不一样的地方是,对于这种特征, 我们在得到每个特征的embedding表示之后, 还需要通过一个Attention_layer计算用户的历史行为和当前候选文章的相关性以此得到当前用户的embedding向量, 这个向量就可以基于当前的候选文章与用户过去点击过得历史文章的相似性的程度来反应用户的兴趣, 并且随着用户的不同的历史点击来变化,去动态的模拟用户兴趣的变化过程。这类特征对于每个用户都是一个历史行为序列, 对于每个用户, 历史行为序列长度会不一样, 可能有的用户点击的历史文章多,有的点击的历史文章少, 所以我们还需要把这个长度统一起来, 在为DIN模型准备数据的时候, 我们首先要通过SparseFeat函数指明这些类别型特征, 然后还需要通过VarLenSparseFeat函数再进行序列填充, 使得每个用户的历史序列一样长, 所以这个函数参数中会有个maxlen,来指明序列的最大长度是多少。\n",
+ " 3. 对于连续型特征列, 我们只需要用DenseFeat函数来指明列名和维度即可。\n",
+ "2. 处理完特征列之后, 我们把相应的数据与列进行对应,就得到了最后的数据。\n",
+ "\n",
+ "下面根据具体的代码感受一下, 逻辑是这样, 首先我们需要写一个数据准备函数, 在这里面就是根据上面的具体步骤准备数据, 得到数据和特征列, 然后就是建立DIN模型并训练, 最后基于模型进行测试。"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 25,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:08.405211Z",
+ "start_time": "2020-11-18T04:26:04.887013Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 导入deepctr\n",
+ "from deepctr.models import DIN\n",
+ "from deepctr.feature_column import SparseFeat, VarLenSparseFeat, DenseFeat, get_feature_names\n",
+ "from tensorflow.keras.preprocessing.sequence import pad_sequences\n",
+ "\n",
+ "from tensorflow.keras import backend as K\n",
+ "from tensorflow.keras.layers import *\n",
+ "from tensorflow.keras.models import *\n",
+ "from tensorflow.keras.callbacks import * \n",
+ "import tensorflow as tf\n",
+ "\n",
+ "import os\n",
+ "os.environ[\"CUDA_DEVICE_ORDER\"] = \"PCI_BUS_ID\"\n",
+ "os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"2\""
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 26,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:13.485712Z",
+ "start_time": "2020-11-18T04:26:13.476042Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 数据准备函数\n",
+ "def get_din_feats_columns(df, dense_fea, sparse_fea, behavior_fea, his_behavior_fea, emb_dim=32, max_len=100):\n",
+ " \"\"\"\n",
+ " 数据准备函数:\n",
+ " df: 数据集\n",
+ " dense_fea: 数值型特征列\n",
+ " sparse_fea: 离散型特征列\n",
+ " behavior_fea: 用户的候选行为特征列\n",
+ " his_behavior_fea: 用户的历史行为特征列\n",
+ " embedding_dim: embedding的维度, 这里为了简单, 统一把离散型特征列采用一样的隐向量维度\n",
+ " max_len: 用户序列的最大长度\n",
+ " \"\"\"\n",
+ " \n",
+ " sparse_feature_columns = [SparseFeat(feat, vocabulary_size=df[feat].nunique() + 1, embedding_dim=emb_dim) for feat in sparse_fea]\n",
+ " \n",
+ " dense_feature_columns = [DenseFeat(feat, 1, ) for feat in dense_fea]\n",
+ " \n",
+ " var_feature_columns = [VarLenSparseFeat(SparseFeat(feat, vocabulary_size=df['click_article_id'].nunique() + 1,\n",
+ " embedding_dim=emb_dim, embedding_name='click_article_id'), maxlen=max_len) for feat in hist_behavior_fea]\n",
+ " \n",
+ " dnn_feature_columns = sparse_feature_columns + dense_feature_columns + var_feature_columns\n",
+ " \n",
+ " # 建立x, x是一个字典的形式\n",
+ " x = {}\n",
+ " for name in get_feature_names(dnn_feature_columns):\n",
+ " if name in his_behavior_fea:\n",
+ " # 这是历史行为序列\n",
+ " his_list = [l for l in df[name]]\n",
+ " x[name] = pad_sequences(his_list, maxlen=max_len, padding='post') # 二维数组\n",
+ " else:\n",
+ " x[name] = df[name].values\n",
+ " \n",
+ " return x, dnn_feature_columns"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 27,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:18.783217Z",
+ "start_time": "2020-11-18T04:26:18.776795Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 把特征分开\n",
+ "sparse_fea = ['user_id', 'click_article_id', 'category_id', 'click_environment', 'click_deviceGroup', \n",
+ " 'click_os', 'click_country', 'click_region', 'click_referrer_type', 'is_cat_hab']\n",
+ "\n",
+ "behavior_fea = ['click_article_id']\n",
+ "\n",
+ "hist_behavior_fea = ['hist_click_article_id']\n",
+ "\n",
+ "dense_fea = ['sim0', 'time_diff0', 'word_diff0', 'sim_max', 'sim_min', 'sim_sum', 'sim_mean', 'score',\n",
+ " 'rank','click_size','time_diff_mean','active_level','user_time_hob1','user_time_hob2',\n",
+ " 'words_hbo','words_count']"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 28,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:25.469810Z",
+ "start_time": "2020-11-18T04:26:24.779347Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# dense特征进行归一化, 神经网络训练都需要将数值进行归一化处理\n",
+ "mm = MinMaxScaler()\n",
+ "\n",
+ "# 下面是做一些特殊处理,当在其他的地方出现无效值的时候,不处理无法进行归一化,刚开始可以先把他注释掉,在运行了下面的代码\n",
+ "# 之后如果发现报错,应该先去想办法处理如何不出现inf之类的值\n",
+ "# trn_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
+ "# tst_user_item_feats_df_din_model.replace([np.inf, -np.inf], 0, inplace=True)\n",
+ "\n",
+ "for feat in dense_fea:\n",
+ " trn_user_item_feats_df_din_model[feat] = mm.fit_transform(trn_user_item_feats_df_din_model[[feat]])\n",
+ " \n",
+ " if val_user_item_feats_df_din_model is not None:\n",
+ " val_user_item_feats_df_din_model[feat] = mm.fit_transform(val_user_item_feats_df_din_model[[feat]])\n",
+ " \n",
+ " tst_user_item_feats_df_din_model[feat] = mm.fit_transform(tst_user_item_feats_df_din_model[[feat]])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 29,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:36.727753Z",
+ "start_time": "2020-11-18T04:26:28.854705Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/initializers.py:143: calling RandomNormal.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 准备训练数据\n",
+ "x_trn, dnn_feature_columns = get_din_feats_columns(trn_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ "y_trn = trn_user_item_feats_df_din_model['label'].values\n",
+ "\n",
+ "if offline:\n",
+ " # 准备验证数据\n",
+ " x_val, dnn_feature_columns = get_din_feats_columns(val_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ " y_val = val_user_item_feats_df_din_model['label'].values\n",
+ " \n",
+ "dense_fea = [x for x in dense_fea if x != 'label']\n",
+ "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 30,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:26:45.146318Z",
+ "start_time": "2020-11-18T04:26:40.423914Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1288: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Call initializer instance with the dtype argument instead of passing it to the constructor\n",
+ "WARNING:tensorflow:From /home/ryluo/anaconda3/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py:255: add_dispatch_support..wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n",
+ "Instructions for updating:\n",
+ "Use tf.where in 2.0, which has the same broadcast rule as np.where\n",
+ "Model: \"model\"\n",
+ "__________________________________________________________________________________________________\n",
+ "Layer (type) Output Shape Param # Connected to \n",
+ "==================================================================================================\n",
+ "user_id (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_article_id (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "category_id (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_environment (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_deviceGroup (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_os (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_country (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_region (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_referrer_type (InputLayer [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "is_cat_hab (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_user_id (Embedding) (None, 1, 32) 1600032 user_id[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_seq_emb_hist_click_artic multiple 525664 click_article_id[0][0] \n",
+ " hist_click_article_id[0][0] \n",
+ " click_article_id[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_category_id (Embeddi (None, 1, 32) 7776 category_id[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_environment (E (None, 1, 32) 128 click_environment[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_deviceGroup (E (None, 1, 32) 160 click_deviceGroup[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_os (Embedding) (None, 1, 32) 288 click_os[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_country (Embed (None, 1, 32) 384 click_country[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_region (Embedd (None, 1, 32) 928 click_region[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_click_referrer_type (None, 1, 32) 256 click_referrer_type[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "sparse_emb_is_cat_hab (Embeddin (None, 1, 32) 64 is_cat_hab[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask (NoMask) (None, 1, 32) 0 sparse_emb_user_id[0][0] \n",
+ " sparse_seq_emb_hist_click_article\n",
+ " sparse_emb_category_id[0][0] \n",
+ " sparse_emb_click_environment[0][0\n",
+ " sparse_emb_click_deviceGroup[0][0\n",
+ " sparse_emb_click_os[0][0] \n",
+ " sparse_emb_click_country[0][0] \n",
+ " sparse_emb_click_region[0][0] \n",
+ " sparse_emb_click_referrer_type[0]\n",
+ " sparse_emb_is_cat_hab[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "hist_click_article_id (InputLay [(None, 50)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate (Concatenate) (None, 1, 320) 0 no_mask[0][0] \n",
+ " no_mask[1][0] \n",
+ " no_mask[2][0] \n",
+ " no_mask[3][0] \n",
+ " no_mask[4][0] \n",
+ " no_mask[5][0] \n",
+ " no_mask[6][0] \n",
+ " no_mask[7][0] \n",
+ " no_mask[8][0] \n",
+ " no_mask[9][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_1 (NoMask) (None, 1, 320) 0 concatenate[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "attention_sequence_pooling_laye (None, 1, 32) 13961 sparse_seq_emb_hist_click_article\n",
+ " sparse_seq_emb_hist_click_article\n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate_1 (Concatenate) (None, 1, 352) 0 no_mask_1[0][0] \n",
+ " attention_sequence_pooling_layer[\n",
+ "__________________________________________________________________________________________________\n",
+ "sim0 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "time_diff0 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "word_diff0 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_max (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_min (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_sum (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "sim_mean (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "score (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "rank (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "click_size (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "time_diff_mean (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "active_level (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "user_time_hob1 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "user_time_hob2 (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "words_hbo (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "words_count (InputLayer) [(None, 1)] 0 \n",
+ "__________________________________________________________________________________________________\n",
+ "flatten (Flatten) (None, 352) 0 concatenate_1[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_3 (NoMask) (None, 1) 0 sim0[0][0] \n",
+ " time_diff0[0][0] \n",
+ " word_diff0[0][0] \n",
+ " sim_max[0][0] \n",
+ " sim_min[0][0] \n",
+ " sim_sum[0][0] \n",
+ " sim_mean[0][0] \n",
+ " score[0][0] \n",
+ " rank[0][0] \n",
+ " click_size[0][0] \n",
+ " time_diff_mean[0][0] \n",
+ " active_level[0][0] \n",
+ " user_time_hob1[0][0] \n",
+ " user_time_hob2[0][0] \n",
+ " words_hbo[0][0] \n",
+ " words_count[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_2 (NoMask) (None, 352) 0 flatten[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate_2 (Concatenate) (None, 16) 0 no_mask_3[0][0] \n",
+ " no_mask_3[1][0] \n",
+ " no_mask_3[2][0] \n",
+ " no_mask_3[3][0] \n",
+ " no_mask_3[4][0] \n",
+ " no_mask_3[5][0] \n",
+ " no_mask_3[6][0] \n",
+ " no_mask_3[7][0] \n",
+ " no_mask_3[8][0] \n",
+ " no_mask_3[9][0] \n",
+ " no_mask_3[10][0] \n",
+ " no_mask_3[11][0] \n",
+ " no_mask_3[12][0] \n",
+ " no_mask_3[13][0] \n",
+ " no_mask_3[14][0] \n",
+ " no_mask_3[15][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "flatten_1 (Flatten) (None, 352) 0 no_mask_2[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "flatten_2 (Flatten) (None, 16) 0 concatenate_2[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "no_mask_4 (NoMask) multiple 0 flatten_1[0][0] \n",
+ " flatten_2[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "concatenate_3 (Concatenate) (None, 368) 0 no_mask_4[0][0] \n",
+ " no_mask_4[1][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "dnn_1 (DNN) (None, 80) 89880 concatenate_3[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "dense (Dense) (None, 1) 80 dnn_1[0][0] \n",
+ "__________________________________________________________________________________________________\n",
+ "prediction_layer (PredictionLay (None, 1) 1 dense[0][0] \n",
+ "==================================================================================================\n",
+ "Total params: 2,239,602\n",
+ "Trainable params: 2,239,362\n",
+ "Non-trainable params: 240\n",
+ "__________________________________________________________________________________________________\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 建立模型\n",
+ "model = DIN(dnn_feature_columns, behavior_fea)\n",
+ "\n",
+ "# 查看模型结构\n",
+ "model.summary()\n",
+ "\n",
+ "# 模型编译\n",
+ "model.compile('adam', 'binary_crossentropy',metrics=['binary_crossentropy', tf.keras.metrics.AUC()])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 31,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:28:43.885773Z",
+ "start_time": "2020-11-18T04:26:48.746787Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Epoch 1/2\n",
+ "290964/290964 [==============================] - 55s 189us/sample - loss: 0.4209 - binary_crossentropy: 0.4206 - auc: 0.7842\n",
+ "Epoch 2/2\n",
+ "290964/290964 [==============================] - 52s 178us/sample - loss: 0.3630 - binary_crossentropy: 0.3618 - auc: 0.8478\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 模型训练\n",
+ "if offline:\n",
+ " history = model.fit(x_trn, y_trn, verbose=1, epochs=10, validation_data=(x_val, y_val) , batch_size=256)\n",
+ "else:\n",
+ " # 也可以使用上面的语句用自己采样出来的验证集\n",
+ " # history = model.fit(x_trn, y_trn, verbose=1, epochs=3, validation_split=0.3, batch_size=256)\n",
+ " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, batch_size=256)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 32,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:29:20.436591Z",
+ "start_time": "2020-11-18T04:28:58.102057Z"
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "500000/500000 [==============================] - 20s 39us/sample\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 模型预测\n",
+ "tst_user_item_feats_df_din_model['pred_score'] = model.predict(x_tst, verbose=1, batch_size=256)\n",
+ "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']].to_csv(save_path + 'din_rank_score.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 33,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:29:34.985535Z",
+ "start_time": "2020-11-18T04:29:26.264531Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score']]\n",
+ "submit(rank_results, topk=5, model_name='din')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-15T06:15:49.490705Z",
+ "start_time": "2020-11-15T06:15:49.473794Z"
+ }
+ },
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 34,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:38:53.760383Z",
+ "start_time": "2020-11-18T04:29:51.737721Z"
+ },
+ "scrolled": true
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Train on 232681 samples, validate on 58283 samples\n",
+ "Epoch 1/2\n",
+ "232681/232681 [==============================] - 44s 189us/sample - loss: 0.2864 - binary_crossentropy: 0.2846 - auc: 0.9008 - val_loss: 0.2830 - val_binary_crossentropy: 0.2813 - val_auc: 0.9072\n",
+ "Epoch 2/2\n",
+ "232681/232681 [==============================] - 44s 187us/sample - loss: 0.2832 - binary_crossentropy: 0.2816 - auc: 0.9034 - val_loss: 0.2846 - val_binary_crossentropy: 0.2830 - val_auc: 0.9053\n",
+ "58283/58283 [==============================] - 2s 36us/sample\n",
+ "500000/500000 [==============================] - 19s 37us/sample\n",
+ "Train on 232798 samples, validate on 58166 samples\n",
+ "Epoch 1/2\n",
+ "232798/232798 [==============================] - 43s 184us/sample - loss: 0.2818 - binary_crossentropy: 0.2802 - auc: 0.9051 - val_loss: 0.2968 - val_binary_crossentropy: 0.2953 - val_auc: 0.9062\n",
+ "Epoch 2/2\n",
+ "232798/232798 [==============================] - 44s 187us/sample - loss: 0.2796 - binary_crossentropy: 0.2782 - auc: 0.9069 - val_loss: 0.2820 - val_binary_crossentropy: 0.2806 - val_auc: 0.9071\n",
+ "58166/58166 [==============================] - 2s 38us/sample\n",
+ "500000/500000 [==============================] - 18s 37us/sample\n",
+ "Train on 232847 samples, validate on 58117 samples\n",
+ "Epoch 1/2\n",
+ "232847/232847 [==============================] - 43s 185us/sample - loss: 0.2786 - binary_crossentropy: 0.2773 - auc: 0.9080 - val_loss: 0.2761 - val_binary_crossentropy: 0.2749 - val_auc: 0.9113\n",
+ "Epoch 2/2\n",
+ "232847/232847 [==============================] - 39s 166us/sample - loss: 0.2766 - binary_crossentropy: 0.2754 - auc: 0.9097 - val_loss: 0.2872 - val_binary_crossentropy: 0.2862 - val_auc: 0.9090\n",
+ "58117/58117 [==============================] - 2s 34us/sample\n",
+ "500000/500000 [==============================] - 17s 33us/sample\n",
+ "Train on 232716 samples, validate on 58248 samples\n",
+ "Epoch 1/2\n",
+ "232716/232716 [==============================] - 39s 169us/sample - loss: 0.2763 - binary_crossentropy: 0.2753 - auc: 0.9100 - val_loss: 0.2739 - val_binary_crossentropy: 0.2730 - val_auc: 0.9116\n",
+ "Epoch 2/2\n",
+ "232716/232716 [==============================] - 39s 168us/sample - loss: 0.2743 - binary_crossentropy: 0.2735 - auc: 0.9119 - val_loss: 0.2859 - val_binary_crossentropy: 0.2851 - val_auc: 0.9090\n",
+ "58248/58248 [==============================] - 2s 35us/sample\n",
+ "500000/500000 [==============================] - 17s 34us/sample\n",
+ "Train on 232814 samples, validate on 58150 samples\n",
+ "Epoch 1/2\n",
+ "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2747 - binary_crossentropy: 0.2739 - auc: 0.9115 - val_loss: 0.2702 - val_binary_crossentropy: 0.2695 - val_auc: 0.9163\n",
+ "Epoch 2/2\n",
+ "232814/232814 [==============================] - 40s 170us/sample - loss: 0.2725 - binary_crossentropy: 0.2719 - auc: 0.9132 - val_loss: 0.2751 - val_binary_crossentropy: 0.2745 - val_auc: 0.9151\n",
+ "58150/58150 [==============================] - 2s 34us/sample\n",
+ "500000/500000 [==============================] - 17s 34us/sample\n"
+ ]
+ }
+ ],
+ "source": [
+ "# 五折交叉验证,这里的五折交叉是以用户为目标进行五折划分\n",
+ "# 这一部分与前面的单独训练和验证是分开的\n",
+ "def get_kfold_users(trn_df, n=5):\n",
+ " user_ids = trn_df['user_id'].unique()\n",
+ " user_set = [user_ids[i::n] for i in range(n)]\n",
+ " return user_set\n",
+ "\n",
+ "k_fold = 5\n",
+ "trn_df = trn_user_item_feats_df_din_model\n",
+ "user_set = get_kfold_users(trn_df, n=k_fold)\n",
+ "\n",
+ "score_list = []\n",
+ "score_df = trn_df[['user_id', 'click_article_id', 'label']]\n",
+ "sub_preds = np.zeros(tst_user_item_feats_df_rank_model.shape[0])\n",
+ "\n",
+ "dense_fea = [x for x in dense_fea if x != 'label']\n",
+ "x_tst, dnn_feature_columns = get_din_feats_columns(tst_user_item_feats_df_din_model, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ "\n",
+ "# 五折交叉验证,并将中间结果保存用于staking\n",
+ "for n_fold, valid_user in enumerate(user_set):\n",
+ " train_idx = trn_df[~trn_df['user_id'].isin(valid_user)] # add slide user\n",
+ " valid_idx = trn_df[trn_df['user_id'].isin(valid_user)]\n",
+ " \n",
+ " # 准备训练数据\n",
+ " x_trn, dnn_feature_columns = get_din_feats_columns(train_idx, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ " y_trn = train_idx['label'].values\n",
+ "\n",
+ " # 准备验证数据\n",
+ " x_val, dnn_feature_columns = get_din_feats_columns(valid_idx, dense_fea, \n",
+ " sparse_fea, behavior_fea, hist_behavior_fea, max_len=50)\n",
+ " y_val = valid_idx['label'].values\n",
+ " \n",
+ " history = model.fit(x_trn, y_trn, verbose=1, epochs=2, validation_data=(x_val, y_val) , batch_size=256)\n",
+ " \n",
+ " # 预测验证集结果\n",
+ " valid_idx['pred_score'] = model.predict(x_val, verbose=1, batch_size=256) \n",
+ " \n",
+ " valid_idx.sort_values(by=['user_id', 'pred_score'])\n",
+ " valid_idx['pred_rank'] = valid_idx.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ " \n",
+ " # 将验证集的预测结果放到一个列表中,后面进行拼接\n",
+ " score_list.append(valid_idx[['user_id', 'click_article_id', 'pred_score', 'pred_rank']])\n",
+ " \n",
+ " # 如果是线上测试,需要计算每次交叉验证的结果相加,最后求平均\n",
+ " if not offline:\n",
+ " sub_preds += model.predict(x_tst, verbose=1, batch_size=256)[:, 0] \n",
+ " \n",
+ "score_df_ = pd.concat(score_list, axis=0)\n",
+ "score_df = score_df.merge(score_df_, how='left', on=['user_id', 'click_article_id'])\n",
+ "# 保存训练集交叉验证产生的新特征\n",
+ "score_df[['user_id', 'click_article_id', 'pred_score', 'pred_rank', 'label']].to_csv(save_path + 'trn_din_cls_feats.csv', index=False)\n",
+ " \n",
+ "# 测试集的预测结果,多次交叉验证求平均,将预测的score和对应的rank特征保存,可以用于后面的staking,这里还可以构造其他更多的特征\n",
+ "tst_user_item_feats_df_din_model['pred_score'] = sub_preds / k_fold\n",
+ "tst_user_item_feats_df_din_model['pred_score'] = tst_user_item_feats_df_din_model['pred_score'].transform(lambda x: norm_sim(x))\n",
+ "tst_user_item_feats_df_din_model.sort_values(by=['user_id', 'pred_score'])\n",
+ "tst_user_item_feats_df_din_model['pred_rank'] = tst_user_item_feats_df_din_model.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')\n",
+ "\n",
+ "# 保存测试集交叉验证的新特征\n",
+ "tst_user_item_feats_df_din_model[['user_id', 'click_article_id', 'pred_score', 'pred_rank']].to_csv(save_path + 'tst_din_cls_feats.csv', index=False)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 模型融合"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## 加权融合"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 35,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:27.351996Z",
+ "start_time": "2020-11-18T04:44:26.561275Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取多个模型的排序结果文件\n",
+ "lgb_ranker = pd.read_csv(save_path + 'lgb_ranker_score.csv')\n",
+ "lgb_cls = pd.read_csv(save_path + 'lgb_cls_score.csv')\n",
+ "din_ranker = pd.read_csv(save_path + 'din_rank_score.csv')\n",
+ "\n",
+ "# 这里也可以换成交叉验证输出的测试结果进行加权融合"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 36,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:31.593981Z",
+ "start_time": "2020-11-18T04:44:31.589439Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "rank_model = {'lgb_ranker': lgb_ranker, \n",
+ " 'lgb_cls': lgb_cls, \n",
+ " 'din_ranker': din_ranker}"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:36.135860Z",
+ "start_time": "2020-11-18T04:44:36.130577Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def get_ensumble_predict_topk(rank_model, topk=5):\n",
+ " final_recall = rank_model['lgb_cls'].append(rank_model['din_ranker'])\n",
+ " rank_model['lgb_ranker']['pred_score'] = rank_model['lgb_ranker']['pred_score'].transform(lambda x: norm_sim(x))\n",
+ " \n",
+ " final_recall = final_recall.append(rank_model['lgb_ranker'])\n",
+ " final_recall = final_recall.groupby(['user_id', 'click_article_id'])['pred_score'].sum().reset_index()\n",
+ " \n",
+ " submit(final_recall, topk=topk, model_name='ensemble_fuse')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:51.659270Z",
+ "start_time": "2020-11-18T04:44:40.445659Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "get_ensumble_predict_topk(rank_model)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Staking"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:44:58.025992Z",
+ "start_time": "2020-11-18T04:44:56.146962Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 读取多个模型的交叉验证生成的结果文件\n",
+ "# 训练集\n",
+ "trn_lgb_ranker_feats = pd.read_csv(save_path + 'trn_lgb_ranker_feats.csv')\n",
+ "trn_lgb_cls_feats = pd.read_csv(save_path + 'trn_lgb_cls_feats.csv')\n",
+ "trn_din_cls_feats = pd.read_csv(save_path + 'trn_din_cls_feats.csv')\n",
+ "\n",
+ "# 测试集\n",
+ "tst_lgb_ranker_feats = pd.read_csv(save_path + 'tst_lgb_ranker_feats.csv')\n",
+ "tst_lgb_cls_feats = pd.read_csv(save_path + 'tst_lgb_cls_feats.csv')\n",
+ "tst_din_cls_feats = pd.read_csv(save_path + 'tst_din_cls_feats.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 40,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:45:07.701862Z",
+ "start_time": "2020-11-18T04:45:07.644335Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 将多个模型输出的特征进行拼接\n",
+ "\n",
+ "finall_trn_ranker_feats = trn_lgb_ranker_feats[['user_id', 'click_article_id', 'label']]\n",
+ "finall_tst_ranker_feats = tst_lgb_ranker_feats[['user_id', 'click_article_id']]\n",
+ "\n",
+ "for idx, trn_model in enumerate([trn_lgb_ranker_feats, trn_lgb_cls_feats, trn_din_cls_feats]):\n",
+ " for feat in [ 'pred_score', 'pred_rank']:\n",
+ " col_name = feat + '_' + str(idx)\n",
+ " finall_trn_ranker_feats[col_name] = trn_model[feat]\n",
+ "\n",
+ "for idx, tst_model in enumerate([tst_lgb_ranker_feats, tst_lgb_cls_feats, tst_din_cls_feats]):\n",
+ " for feat in [ 'pred_score', 'pred_rank']:\n",
+ " col_name = feat + '_' + str(idx)\n",
+ " finall_tst_ranker_feats[col_name] = tst_model[feat]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 41,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:45:15.044242Z",
+ "start_time": "2020-11-18T04:45:13.138252Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 定义一个逻辑回归模型再次拟合交叉验证产生的特征对测试集进行预测\n",
+ "# 这里需要注意的是,在做交叉验证的时候可以构造多一些与输出预测值相关的特征,来丰富这里简单模型的特征\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "\n",
+ "feat_cols = ['pred_score_0', 'pred_rank_0', 'pred_score_1', 'pred_rank_1', 'pred_score_2', 'pred_rank_2']\n",
+ "\n",
+ "trn_x = finall_trn_ranker_feats[feat_cols]\n",
+ "trn_y = finall_trn_ranker_feats['label']\n",
+ "\n",
+ "tst_x = finall_tst_ranker_feats[feat_cols]\n",
+ "\n",
+ "# 定义模型\n",
+ "lr = LogisticRegression()\n",
+ "\n",
+ "# 模型训练\n",
+ "lr.fit(trn_x, trn_y)\n",
+ "\n",
+ "# 模型预测\n",
+ "finall_tst_ranker_feats['pred_score'] = lr.predict_proba(tst_x)[:, 1]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 42,
+ "metadata": {
+ "ExecuteTime": {
+ "end_time": "2020-11-18T04:45:29.018764Z",
+ "start_time": "2020-11-18T04:45:19.423130Z"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "# 预测结果重新排序, 及生成提交结果\n",
+ "rank_results = finall_tst_ranker_feats[['user_id', 'click_article_id', 'pred_score']]\n",
+ "submit(rank_results, topk=5, model_name='ensumble_staking')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# 总结\n",
+ "本章主要学习了三个排序模型,包括LGB的Rank, LGB的Classifier还有深度学习的DIN模型, 当然,对于这三个模型的原理部分,我们并没有给出详细的介绍, 请大家课下自己探索原理,也欢迎大家把自己的探索与所学分享出来,我们一块学习和进步。最后,我们进行了简单的模型融合策略,包括简单的加权和Stacking。\n",
+ "\n",
+ "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
+ "\n",
+ "![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
+ ]
}
- },
- "outputs": [],
- "source": [
- "# 预测结果重新排序, 及生成提交结果\n",
- "rank_results = finall_tst_ranker_feats[['user_id', 'click_article_id', 'pred_score']]\n",
- "submit(rank_results, topk=5, model_name='ensumble_staking')"
- ]
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# 总结\n",
- "本章主要学习了三个排序模型,包括LGB的Rank, LGB的Classifier还有深度学习的DIN模型, 当然,对于这三个模型的原理部分,我们并没有给出详细的介绍, 请大家课下自己探索原理,也欢迎大家把自己的探索与所学分享出来,我们一块学习和进步。最后,我们进行了简单的模型融合策略,包括简单的加权和Stacking。\n",
- "\n",
- "关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:\n",
- "\n",
- "![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)"
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.6.8"
- },
- "latex_envs": {
- "LaTeX_envs_menu_present": true,
- "autoclose": false,
- "autocomplete": true,
- "bibliofile": "biblio.bib",
- "cite_by": "apalike",
- "current_citInitial": 1,
- "eqLabelWithNumbers": true,
- "eqNumInitial": 1,
- "hotkeys": {
- "equation": "Ctrl-E",
- "itemize": "Ctrl-I"
- },
- "labels_anchors": false,
- "latex_user_defs": false,
- "report_style_numbering": false,
- "user_envs_cfg": false
- },
- "toc": {
- "base_numbering": 1,
- "nav_menu": {},
- "number_sections": true,
- "sideBar": true,
- "skip_h1_title": false,
- "title_cell": "Table of Contents",
- "title_sidebar": "Contents",
- "toc_cell": false,
- "toc_position": {
- "height": "calc(100% - 180px)",
- "left": "10px",
- "top": "150px",
- "width": "170px"
- },
- "toc_section_display": true,
- "toc_window_display": true
- },
- "varInspector": {
- "cols": {
- "lenName": 16,
- "lenType": 16,
- "lenVar": 40
- },
- "kernels_config": {
- "python": {
- "delete_cmd_postfix": "",
- "delete_cmd_prefix": "del ",
- "library": "var_list.py",
- "varRefreshCmd": "print(var_dic_list())"
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.8"
},
- "r": {
- "delete_cmd_postfix": ") ",
- "delete_cmd_prefix": "rm(",
- "library": "var_list.r",
- "varRefreshCmd": "cat(var_dic_list()) "
+ "latex_envs": {
+ "LaTeX_envs_menu_present": true,
+ "autoclose": false,
+ "autocomplete": true,
+ "bibliofile": "biblio.bib",
+ "cite_by": "apalike",
+ "current_citInitial": 1,
+ "eqLabelWithNumbers": true,
+ "eqNumInitial": 1,
+ "hotkeys": {
+ "equation": "Ctrl-E",
+ "itemize": "Ctrl-I"
+ },
+ "labels_anchors": false,
+ "latex_user_defs": false,
+ "report_style_numbering": false,
+ "user_envs_cfg": false
+ },
+ "toc": {
+ "base_numbering": 1,
+ "nav_menu": {},
+ "number_sections": true,
+ "sideBar": true,
+ "skip_h1_title": false,
+ "title_cell": "Table of Contents",
+ "title_sidebar": "Contents",
+ "toc_cell": false,
+ "toc_position": {
+ "height": "calc(100% - 180px)",
+ "left": "10px",
+ "top": "150px",
+ "width": "170px"
+ },
+ "toc_section_display": true,
+ "toc_window_display": true
+ },
+ "varInspector": {
+ "cols": {
+ "lenName": 16,
+ "lenType": 16,
+ "lenVar": 40
+ },
+ "kernels_config": {
+ "python": {
+ "delete_cmd_postfix": "",
+ "delete_cmd_prefix": "del ",
+ "library": "var_list.py",
+ "varRefreshCmd": "print(var_dic_list())"
+ },
+ "r": {
+ "delete_cmd_postfix": ") ",
+ "delete_cmd_prefix": "rm(",
+ "library": "var_list.r",
+ "varRefreshCmd": "cat(var_dic_list()) "
+ }
+ },
+ "types_to_exclude": [
+ "module",
+ "function",
+ "builtin_function_or_method",
+ "instance",
+ "_Feature"
+ ],
+ "window_display": false
}
- },
- "types_to_exclude": [
- "module",
- "function",
- "builtin_function_or_method",
- "instance",
- "_Feature"
- ],
- "window_display": false
- }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
\ No newline at end of file
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.md"
index 645152157..5c3930fe0 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.1 \350\265\233\351\242\230\347\220\206\350\247\243+Baseline.md"
@@ -377,7 +377,7 @@ submit(tst_recall, topk=5, model_name='itemcf_baseline')
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.md"
index 173d95002..5584973fa 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.2 \346\225\260\346\215\256\345\210\206\346\236\220.md"
@@ -66,7 +66,7 @@ trn_click = trn_click.merge(item_df, how='left', on=['click_article_id'])
trn_click.head()
```
-![image-20201119112706647](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112706647.png)
+![image-20201119112706647](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112706647.png)
**train_click_log.csv文件数据中每个字段的含义**
@@ -86,7 +86,7 @@ trn_click.head()
trn_click.info()
```
-![image-20201119112622939](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112622939.png)
+![image-20201119112622939](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112622939.png)
@@ -94,7 +94,7 @@ trn_click.info()
trn_click.describe()
```
-![image-20201119112649376](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112649376.png)
+![image-20201119112649376](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112649376.png)
```python
@@ -133,7 +133,7 @@ plt.tight_layout()
plt.show()
```
-![在这里插入图片描述](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/20201118000820300.png)
+![在这里插入图片描述](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/20201118000820300.png)
**从点击时间clik_timestamp来看,分布较为平均,可不做特殊处理。由于时间戳是13位的,后续将时间格式转换成10位方便计算。**
@@ -149,14 +149,14 @@ tst_click = tst_click.merge(item_df, how='left', on=['click_article_id'])
tst_click.head()
```
-![image-20201119112952261](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112952261.png)
+![image-20201119112952261](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112952261.png)
```python
tst_click.describe()
```
-![image-20201119113015529](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113015529.png)
+![image-20201119113015529](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113015529.png)
**我们可以看出训练集和测试集的用户是完全不一样的**
@@ -187,14 +187,14 @@ tst_click.groupby('user_id')['click_article_id'].count().min() # 注意测试集
item_df.head().append(item_df.tail())
```
-![image-20201119113118388](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113118388.png)
+![image-20201119113118388](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113118388.png)
```python
item_df['words_count'].value_counts()
```
-![image-20201119113147240](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113147240.png)
+![image-20201119113147240](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113147240.png)
```python
@@ -219,7 +219,7 @@ item_df.shape # 364047篇文章
item_emb_df.head()
```
-![image-20201119113253455](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113253455.png)
+![image-20201119113253455](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113253455.png)
```python
item_emb_df.shape
@@ -245,21 +245,21 @@ user_click_count = user_click_merge.groupby(['user_id', 'click_article_id'])['cl
user_click_count[:10]
```
-![image-20201119113334727](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113334727.png)
+![image-20201119113334727](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113334727.png)
```python
user_click_count[user_click_count['count']>7]
```
-![image-20201119113351807](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113351807.png)
+![image-20201119113351807](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113351807.png)
```python
user_click_count['count'].unique()
```
-![image-20201119113429769](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113429769.png)
+![image-20201119113429769](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113429769.png)
```python
@@ -267,7 +267,7 @@ user_click_count['count'].unique()
user_click_count.loc[:,'count'].value_counts()
```
-![image-20201119113414785](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113414785.png)
+![image-20201119113414785](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113414785.png)
**可以看出:有1605541(约占99.2%)的用户未重复阅读过文章,仅有极少数用户重复点击过某篇文章。 这个也可以单独制作成特征**
@@ -301,15 +301,15 @@ for _, user_df in sample_users.groupby('user_id'):
plot_envs(user_df, cols, 2, 3)
```
-![image-20201119113624424](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113624424.png)
+![image-20201119113624424](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113624424.png)
-![image-20201119113637746](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113637746.png)
+![image-20201119113637746](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113637746.png)
-![image-20201119113652132](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113652132.png)
+![image-20201119113652132](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113652132.png)
-![image-20201119113702034](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113702034.png)
+![image-20201119113702034](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113702034.png)
-![image-20201119113714135](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113714135.png)
+![image-20201119113714135](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113714135.png)
**可以看出绝大多数数的用户的点击环境是比较固定的。思路:可以基于这些环境的统计特征来代表该用户本身的属性**
@@ -322,7 +322,7 @@ plt.plot(user_click_item_count)
```
-![image-20201119113759490](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113759490.png)
+![image-20201119113759490](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113759490.png)
**可以根据用户的点击文章次数看出用户的活跃度**
@@ -332,7 +332,7 @@ plt.plot(user_click_item_count)
plt.plot(user_click_item_count[:50])
```
-![image-20201119113825586](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113825586.png)
+![image-20201119113825586](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113825586.png)
**点击次数排前50的用户的点击次数都在100次以上。思路:我们可以定义点击次数大于等于100次的用户为活跃用户,这是一种简单的处理思路, 判断用户活跃度,更加全面的是再结合上点击时间,后面我们会基于点击次数和点击时间两个方面来判断用户活跃度。**
@@ -342,7 +342,7 @@ plt.plot(user_click_item_count[:50])
plt.plot(user_click_item_count[25000:50000])
```
-![image-20201119113844946](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113844946.png)
+![image-20201119113844946](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113844946.png)
**可以看出点击次数小于等于两次的用户非常的多,这些用户可以认为是非活跃用户**
@@ -358,14 +358,14 @@ item_click_count = sorted(user_click_merge.groupby('click_article_id')['user_id'
plt.plot(item_click_count)
```
-![image-20201119113912912](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113912912.png)
+![image-20201119113912912](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113912912.png)
```python
plt.plot(item_click_count[:100])
```
-![image-20201119113930745](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113930745.png)
+![image-20201119113930745](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113930745.png)
**可以看出点击次数最多的前100篇新闻,点击次数大于1000次**
@@ -374,7 +374,7 @@ plt.plot(item_click_count[:100])
plt.plot(item_click_count[:20])
```
-![image-20201119113958254](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113958254.png)
+![image-20201119113958254](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119113958254.png)
**点击次数最多的前20篇新闻,点击次数大于2500。思路:可以定义这些新闻为热门新闻, 这个也是简单的处理方式,后面我们也是根据点击次数和时间进行文章热度的一个划分。**
@@ -383,7 +383,7 @@ plt.plot(item_click_count[:20])
plt.plot(item_click_count[3500:])
```
-![image-20201119114017762](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114017762.png)
+![image-20201119114017762](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114017762.png)
**可以发现很多新闻只被点击过一两次。思路:可以定义这些新闻是冷门新闻。**
@@ -397,7 +397,7 @@ union_item = tmp.groupby(['click_article_id','next_item'])['click_timestamp'].ag
union_item[['count']].describe()
```
-![image-20201119114044351](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114044351.png)
+![image-20201119114044351](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114044351.png)
**由统计数据可以看出,平均共现次数2.88,最高为1687。**
@@ -411,14 +411,14 @@ y = union_item['count']
plt.scatter(x, y)
```
-![image-20201119114106223](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114106223.png)
+![image-20201119114106223](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114106223.png)
```python
plt.plot(union_item['count'].values[40000:])
```
-![image-20201119114122557](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114122557.png)
+![image-20201119114122557](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114122557.png)
**大概有70000个pair至少共现一次。**
@@ -432,7 +432,7 @@ plt.plot(union_item['count'].values[40000:])
plt.plot(user_click_merge['category_id'].value_counts().values)
```
-![image-20201119114144058](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114144058.png)
+![image-20201119114144058](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114144058.png)
```python
@@ -440,7 +440,7 @@ plt.plot(user_click_merge['category_id'].value_counts().values)
plt.plot(user_click_merge['category_id'].value_counts().values[150:])
```
-![image-20201119114201764](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114201764.png)
+![image-20201119114201764](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114201764.png)
```python
@@ -455,7 +455,7 @@ user_click_merge['words_count'].describe()
plt.plot(user_click_merge['words_count'].values)
```
-![image-20201119114241194](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114241194.png)
+![image-20201119114241194](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114241194.png)
@@ -469,7 +469,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), re
```
-![image-20201119114300286](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114300286.png)
+![image-20201119114300286](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114300286.png)
**从上图中可以看出有一小部分用户阅读类型是极其广泛的,大部分人都处在20个新闻类型以下。**
@@ -478,7 +478,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['category_id'].nunique(), re
user_click_merge.groupby('user_id')['category_id'].nunique().reset_index().describe()
```
-![image-20201119114318523](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114318523.png)
+![image-20201119114318523](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114318523.png)
### 用户查看文章的长度的分布
@@ -490,7 +490,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), rever
```
-![image-20201119114337448](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114337448.png)
+![image-20201119114337448](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114337448.png)
@@ -504,7 +504,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), rever
plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), reverse=True)[1000:45000])
```
-![image-20201119114355195](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114355195.png)
+![image-20201119114355195](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114355195.png)
**可以发现大多数人都是看250字以下的文章**
@@ -514,7 +514,7 @@ plt.plot(sorted(user_click_merge.groupby('user_id')['words_count'].mean(), rever
user_click_merge.groupby('user_id')['words_count'].mean().reset_index().describe()
```
-![image-20201119114418911](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114418911.png)
+![image-20201119114418911](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114418911.png)
@@ -536,7 +536,7 @@ user_click_merge = user_click_merge.sort_values('click_timestamp')
user_click_merge.head()
```
-![image-20201119114447904](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114447904.png)
+![image-20201119114447904](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114447904.png)
```python
@@ -558,7 +558,7 @@ mean_diff_click_time = user_click_merge.groupby('user_id')['click_timestamp', 'c
plt.plot(sorted(mean_diff_click_time.values, reverse=True))
```
-![image-20201119114505086](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114505086.png)
+![image-20201119114505086](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119114505086.png)
**从上图可以发现不同用户点击文章的时间差是有差异的。**
@@ -573,7 +573,7 @@ mean_diff_created_time = user_click_merge.groupby('user_id')['click_timestamp',
plt.plot(sorted(mean_diff_created_time.values, reverse=True))
```
-![image-20201119122227666](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122227666.png)
+![image-20201119122227666](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122227666.png)
**从图中可以发现用户先后点击文章,文章的创建时间也是有差异的**
@@ -602,7 +602,7 @@ sub_user_info = user_click_merge[user_click_merge['user_id'].isin(sub_user_ids)]
sub_user_info.head()
```
-![image-20201119122251274](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122251274.png)
+![image-20201119122251274](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122251274.png)
```python
@@ -625,7 +625,7 @@ for _, user_df in sub_user_info.groupby('user_id'):
```
-![image-20201119122310969](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122310969.png)
+![image-20201119122310969](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119122310969.png)
@@ -654,5 +654,5 @@ for _, user_df in sub_user_info.groupby('user_id'):
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.md"
index 323cf46fe..9bf554093 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.3 \345\244\232\350\267\257\345\217\254\345\233\236.md"
@@ -2,7 +2,7 @@
所谓的“多路召回”策略,就是指采用不同的策略、特征或简单模型,分别召回一部分候选集,然后把候选集混合在一起供后续排序模型使用,可以明显的看出,“多路召回策略”是在“计算速度”和“召回率”之间进行权衡的结果。其中,各种简单策略保证候选集的快速召回,从不同角度设计的策略保证召回率接近理想的状态,不至于损伤排序效果。如下图是多路召回的一个示意图,在多路召回中,每个策略之间毫不相关,所以一般可以写并发多线程同时进行,这样可以更加高效。
-
+
上图只是一个多路召回的例子,也就是说可以使用多种不同的策略来获取用户排序的候选商品集合,而具体使用哪些召回策略其实是与业务强相关的 ,针对不同的任务就会有对于该业务真实场景下需要考虑的召回规则。例如新闻推荐,召回规则可以是“热门视频”、“导演召回”、“演员召回”、“最近上映“、”流行趋势“、”类型召回“等等。
@@ -1344,4 +1344,4 @@ final_recall_items_dict_rank = combine_recall_results(user_multi_recall_dict, we
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.md"
index 197765e8b..e5e267f0e 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.4 \347\211\271\345\276\201\345\267\245\347\250\213.md"
@@ -193,7 +193,7 @@ Word2Vec主要思想是:一个词的上下文可以很好的表达出词的语
- skip-gram:已知中心词预测周围词。
- cbow:已知周围词预测中心词。
-![image-20201106225233086](http://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)
+![image-20201106225233086](https://ryluo.oss-cn-chengdu.aliyuncs.com/Javaimage-20201106225233086.png)
在使用gensim训练word2vec的时候,有几个比较重要的参数
- size: 表示词向量的维度。
@@ -985,5 +985,5 @@ tst_user_item_feats_df.to_csv(save_path + 'tst_user_item_feats_df.csv', index=Fa
**关于Datawhale:** Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.md"
index 9fef3fda5..0e8f45abe 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.1\347\253\236\350\265\233\345\256\236\350\267\265/markdown/2.5 \346\216\222\345\272\217\346\250\241\345\236\213+\346\250\241\345\236\213\350\236\215\345\220\210.md"
@@ -407,7 +407,7 @@ tst_user_item_feats_df_din_model = tst_user_item_feats_df_din_model.merge(his_be
我们下面尝试使用DIN模型, DIN的全称是Deep Interest Network, 这是阿里2018年基于前面的深度学习模型无法表达用户多样化的兴趣而提出的一个模型, 它可以通过考虑【给定的候选广告】和【用户的历史行为】的相关性,来计算用户兴趣的表示向量。具体来说就是通过引入局部激活单元,通过软搜索历史行为的相关部分来关注相关的用户兴趣,并采用加权和来获得有关候选广告的用户兴趣的表示。与候选广告相关性较高的行为会获得较高的激活权重,并支配着用户兴趣。该表示向量在不同广告上有所不同,大大提高了模型的表达能力。所以该模型对于此次新闻推荐的任务也比较适合, 我们在这里通过当前的候选文章与用户历史点击文章的相关性来计算用户对于文章的兴趣。 该模型的结构如下:
-![image-20201116201646983](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)
+![image-20201116201646983](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201116201646983.png)
我们这里直接调包来使用这个模型, 关于这个模型的详细细节部分我们会在下一期的推荐系统组队学习中给出。下面说一下该模型如何具体使用:deepctr的函数原型如下:
@@ -949,4 +949,4 @@ submit(rank_results, topk=5, model_name='ensumble_staking')
关于Datawhale: Datawhale是一个专注于数据科学与AI领域的开源组织,汇集了众多领域院校和知名企业的优秀学习者,聚合了一群有开源精神和探索精神的团队成员。Datawhale 以“for the learner,和学习者一起成长”为愿景,鼓励真实地展现自我、开放包容、互信互助、敢于试错和勇于担当。同时 Datawhale 用开源的理念去探索开源内容、开源学习和开源方案,赋能人才培养,助力人才成长,建立起人与人,人与知识,人与企业和人与未来的联结。 本次数据挖掘路径学习,专题知识将在天池分享,详情可关注Datawhale:
-![image-20201119112159065](http://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
\ No newline at end of file
+![image-20201119112159065](https://ryluo.oss-cn-chengdu.aliyuncs.com/abc/image-20201119112159065.png)
\ No newline at end of file
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.3 Redis\345\237\272\347\241\200.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.3 Redis\345\237\272\347\241\200.md"
index 2d79c1fbf..9153d9f15 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.3 Redis\345\237\272\347\241\200.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.3 Redis\345\237\272\347\241\200.md"
@@ -20,7 +20,7 @@ sudo apt-get install redis-server
下载完成的结果
-![image-20211030164414594](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164414594.png)
+![image-20211030164414594](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164414594.png)
**启动Redis服务:**
@@ -30,7 +30,7 @@ sudo apt-get install redis-server
service redis-server status
```
-![image-20211030164432589](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164432589.png)
+![image-20211030164432589](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164432589.png)
检查当前进程,查看redis是否启动。(ps: 可以看到redis服务正在监听6379端口)
@@ -38,7 +38,7 @@ service redis-server status
ps -aux|grep redis-server
```
-![image-20211030164448713](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164448713.png)
+![image-20211030164448713](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164448713.png)
或者进入redis客户端,与服务器进行通信,当输入ping命令,如果返回 PONG 表示Redis已成功安装。
@@ -46,7 +46,7 @@ ps -aux|grep redis-server
redis-cli
```
-![image-20211030164455928](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164455928.png)
+![image-20211030164455928](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211030164455928.png)
上面的127.0.0.1 是redis服务器的 IP 地址,6379 是 Redis 服务器运行的端口。
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.4 scrapy\345\237\272\347\241\200\345\217\212\346\226\260\351\227\273\347\210\254\345\217\226\345\256\236\346\210\230.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.4 scrapy\345\237\272\347\241\200\345\217\212\346\226\260\351\227\273\347\210\254\345\217\226\345\256\236\346\210\230.md"
index 8a74c546e..dc29a96f1 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.4 scrapy\345\237\272\347\241\200\345\217\212\346\226\260\351\227\273\347\210\254\345\217\226\345\256\236\346\210\230.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.4 scrapy\345\237\272\347\241\200\345\217\212\346\226\260\351\227\273\347\210\254\345\217\226\345\256\236\346\210\230.md"
@@ -129,7 +129,7 @@ class QuotesSpider(scrapy.Spider):
因为新闻爬取项目和新闻推荐系统是放在一起的,为了方便提前学习,下面直接给出项目的目录结构以及重要文件中的代码实现,最终的项目将会和新闻推荐系统一起开源出来
-
+
1. **创建一个scrapy项目:**
@@ -164,7 +164,7 @@ class SinanewsItem(scrapy.Item):
这里需要注意的一点,这里在爬取新闻的时候选择的是一个比较简洁的展示网站进行爬取的,相比直接去最新的新浪新闻观光爬取新闻简单很多,简洁的网站大概的链接:https://news.sina.com.cn/roll/#pageid=153&lid=2509&k=&num=50&page=1
-
+
```python
# -*- coding: utf-8 -*-
@@ -497,7 +497,7 @@ sh run_scrapy_sina.sh
最终查看数据库中的数据:
-
+
### 参考资料
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.5 \350\207\252\345\212\250\345\214\226\346\236\204\345\273\272\347\224\250\346\210\267\345\217\212\347\211\251\346\226\231\347\224\273\345\203\217.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.5 \350\207\252\345\212\250\345\214\226\346\236\204\345\273\272\347\224\250\346\210\267\345\217\212\347\211\251\346\226\231\347\224\273\345\203\217.md"
index 4cc60eda6..bd9d70acc 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.5 \350\207\252\345\212\250\345\214\226\346\236\204\345\273\272\347\224\250\346\210\267\345\217\212\347\211\251\346\226\231\347\224\273\345\203\217.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.1.5 \350\207\252\345\212\250\345\214\226\346\236\204\345\273\272\347\224\250\346\210\267\345\217\212\347\211\251\346\226\231\347\224\273\345\203\217.md"
@@ -1,4 +1,4 @@
-![image-20211203145147649](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203145147649.png)
+![image-20211203145147649](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203145147649.png)
# 自动化构建用户及物料画像
@@ -19,13 +19,13 @@
首先说一下新物料添加到物料库的逻辑是什么,新物料添加到物料库这件事情肯定是发生在新闻爬取之后的,然后要将新物料添加到物料库还需要对新物料做一些简单的画像处理,目前我们定义的画像字段如下(处理后的画像存储在Mongodb):
-
+
具体的逻辑就是遍历今天爬取的所有文章,然后通过文章的title来判断这篇文章是否已经在物料库中(新闻网站有可能有些相同的文章会出现在多天)来去重。然后再根据我们定义的一些字段,给画像相应的字段初始化,最后就是存入画像物料池中。
关于旧物料画像的更新,这里就需要先了解一下旧物料哪些字段会被用户的行为更新。下面是新闻列表展示页,我们会发现前端会展示新闻的阅读、喜欢及收藏次数。而用户的交互(阅读、点赞和收藏)会改变这些值。
-
+
为了能够实时的在前端显示新闻的这些动态行为信息,我们提前将新闻的动态信息存储到了redis中,线上获取的时候是直接从redis中获取新闻的数据,并且如果用户对新闻产生了交互,那么这些动态信息就会被更新,我们也是直接更新redis中的值,这样做主要是为了能够让前端可以实时的获取的新闻最新的动态画像信息。
@@ -175,9 +175,9 @@ if __name__ == "__main__":
上面的内容说完了物料的更新,接下来介绍一下对于更新完的物料是如何添加到redis数据库中去的。关于新闻内容在redis中的存储,我们将新闻的信息拆成了两部分,一部分是新闻不会发生变化的属性(例如,创建时间、标题、新闻内容等),还有一部分是物料的动态属性,在redis中存储的key的标识分别为:static_news_detail:news_id和dynamic_news_detail:news_id 下面是redis中存储的真实内容
-
+
-
+
这么做的目的是为了线上实时更改物料动态信息的时候更加高效一点。当需要获取某篇新闻的详细信息的时候需要查这两份数据并将数据这两部分数据拼起来最终才发送给前端展示。这部分的代码逻辑如下:
@@ -306,11 +306,11 @@ if __name__ == "__main__":
由于我们系统中将所有注册过的用户都放到了一个表里面(新、老用户),所以每次更新画像的话只需要遍历一遍注册表中的所有用户。再说具体的画像构建逻辑之前,得先了解一下用户画像中包含哪些字段,下面是直接从mongo中查出来的
-
+
从上面可以看出,主要是用户的基本信息和用户历史信息相关的一些标签,对于用户的基本属性特征这个可以直接从注册表中获取,那么对于跟用户历史阅读相关的信息,需要统计用户历史的所有阅读、喜欢和收藏的新闻详细信息。为了得到跟用户历史兴趣相关的信息,我们需要对用户的历史阅读、喜欢和收藏这几个历史记录给存起来,其实这些信息都可以从日志信息中获取得到,但是这里有个工程上的事情得先说明一下,先看下面这个图,对于每个用户点进一篇新闻的详情页
-
+
最底部有个喜欢和收藏,这个前端展示的结果是从后端获取的数据,那就意味着后端需要维护一个用户历史点击及收藏过的文章列表,这里我们使用了mysql来存储,主要是怕redis不够用。其实这两个表不仅仅可以用来前端展示用的,还可以用来分析用户的画像,这都给我们整理好了用户历史喜欢和收藏了。
@@ -622,7 +622,7 @@ echo " "
**crontab定时任务:**
-![image-20211203172613512](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203172613512.png)
+![image-20211203172613512](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203172613512.png)
将定时任务拆解一下:
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.2.3 \345\211\215\345\220\216\347\253\257\344\272\244\344\272\222.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.2.3 \345\211\215\345\220\216\347\253\257\344\272\244\344\272\222.md"
index e251e6515..e736c0fa1 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.2.3 \345\211\215\345\220\216\347\253\257\344\272\244\344\272\222.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.2.3 \345\211\215\345\220\216\347\253\257\344\272\244\344\272\222.md"
@@ -6,7 +6,7 @@
下面主要展现的是项目的整体部分,主要分为推荐页,热门页以及新闻详情页。
-
+
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.3.1 \346\216\250\350\215\220\347\263\273\347\273\237\346\265\201\347\250\213\347\232\204\346\236\204\345\273\272.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.3.1 \346\216\250\350\215\220\347\263\273\347\273\237\346\265\201\347\250\213\347\232\204\346\236\204\345\273\272.md"
index 3beac83f1..b2cce14f1 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.3.1 \346\216\250\350\215\220\347\263\273\347\273\237\346\265\201\347\250\213\347\232\204\346\236\204\345\273\272.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.3.1 \346\216\250\350\215\220\347\263\273\347\273\237\346\265\201\347\250\213\347\232\204\346\236\204\345\273\272.md"
@@ -1,6 +1,6 @@
-![](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片Untitled.png)
+![](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片Untitled.png)
本篇文章主要是讲解推荐系统流程构建,主要包括Offline和Online两个部分。
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.5.1 DSSM\345\217\254\345\233\236.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.5.1 DSSM\345\217\254\345\233\236.md"
index 166009600..d354407c5 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.5.1 DSSM\345\217\254\345\233\236.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/docs/2.2.5.1 DSSM\345\217\254\345\233\236.md"
@@ -12,7 +12,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
### **DSSM 模型结构**
-![image-20220224100424897](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100424897.png)
+![image-20220224100424897](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100424897.png)
上图是DSSM模型的结构,该网络结构比较简单,是一个由几层DNN组成网络,我们将要搜索文本(Query)和要匹配的文本(Document)的 embedding 输入到网络,网络输出为 128 维的向量,然后通过向量之间计算余弦相似度来计算向量之间距离,可以看作每一个 query 和 document 之间相似分数,然后在做 softmax。
@@ -28,7 +28,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
该模型主要是将上述模型中的两个“塔”改为独立的 user 和 item 两个子网络,大概结构如下:
-![img](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-f7ecbf1faf7899c6e2999182055470fb_720w.jpg)
+![img](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-f7ecbf1faf7899c6e2999182055470fb_720w.jpg)
其结构非常简单,如上图所示,左侧是用户塔,右侧是Item塔。在用户侧结构中,其输入为用户侧特征(用户画像信息、统计属性以及历史行为序列等);在用户侧结构中,其输入为Item相关特征(Item基本信息、属性信息等)。对于这两个塔本身,则是经典的DNN模型,在训练过程中,其输入由特征OneHot到特征Embedding,再经过几层DNN隐层,两个塔分别输出user embedding和item embedding,最后这两个embedding做内积或者Cosine相似度计算,使得user和item在embedding映射到共同维度的语义空间中。
@@ -38,7 +38,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
该模型主要的改进是在user塔和Item塔的特征Embedding层上,各自加入一个SENet模块,借助SENet网络用来动态地学习特征的重要性,根据得到的特征权重与对应特征的embedding相乘,进而达到放大重要特征或抑制无效特征的目的,模型大致结构如下所示:
-![img](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-8766fee1b442ed17111d5822033f960f_720w.jpg)
+![img](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片v2-8766fee1b442ed17111d5822033f960f_720w.jpg)
其模型和朴素DSSM模型的区别在于多加了一个SENet网络,该网络主要是将特征的 embedding 通过 Squeeze 和Excitation 两个阶段得到一个权重向量,在用该向量与特征的embeding对应为相乘,挑选出最要特征之后在进入到朴素的DSSM网络中。 而 SENet 之所以起作用的原因,张俊林老师的解释是 SENet 可以突出那些对高层 User embedding 和 Item embedding 的特征交叉起重要作用的特征,更有利于表达两侧的特征交互,避免单侧无效特征经过DNN双塔非线性融合时带来的噪声,同时又带有非线性的作用。关于SENet网络详细内容可以查看[原文](https://arxiv.org/abs/1709.01507)
@@ -48,7 +48,7 @@ DSSM(Deep Structured Semantic Model)是由微软研究院于CIKM在2013年提出
该模型是Youtube于2019年在RecSys发表的一篇工作,这个模型从结构上来看是最普通的双塔。左边是user塔,输入包括两部分,第一部分是user当前正在观看的视频的特征,第二部分user的特征是用户历史行为的统计量,例如用户最近观看的N条视频的id embedding均值,这两部分融合起来一起输入user侧的输入。右边是item塔,将候选视频的特征作为输入,计算item的 embedding。之后也是再计算两侧embedding的相似度,进行学习。 模型的大致结构如下所示:
-![image-20220224100307472](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100307472.png)
+![image-20220224100307472](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20220224100307472.png)
对于该模型,重点并不在于结构上的改变,而是对于负采样问题。因为召回的过程可以被视为是一个多分类问题,模型的输出层选择softmax计算后再计算交叉熵损失。但问题是当候选item特别多的时候,无法对所有的item进行softmax,因此通常的做法是随机从全量item中采样出一个batch的item进行softmax。但是使用batch内的样本作为彼此负样本会带来非常大的偏置问题,即对于热门的样本,被当作负样本的概率更高,因此该模型的贡献在于如何减小batch内负采样所带来的偏置问题? 关于paper的详细内容可以查看[原文](https://dl.acm.org/doi/10.1145/3298689.3346996)
diff --git "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/readme.md" "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/readme.md"
index 2632563b5..2aa49bd87 100644
--- "a/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/readme.md"
+++ "b/docs/\347\254\254\344\272\214\347\253\240 \346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/2.2\346\226\260\351\227\273\346\216\250\350\215\220\347\263\273\347\273\237\345\256\236\346\210\230/readme.md"
@@ -68,4 +68,4 @@ github上给出了参考资料,其实也是用来作为查询的,因为每
如果大家最终在学习完本次的组队学习内容,可以理解下面这张流程图的话,那基本上就很不错了。因为内容真的比较多,而且比较偏向实战,如果要真的弄懂里面的详细流程需要大家花不少时间在看源码上面。
-![image-20211203193754525](http://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203193754525.png)
+![image-20211203193754525](https://ryluo.oss-cn-chengdu.aliyuncs.com/图片image-20211203193754525.png)
diff --git a/readme.md b/readme.md
index bca478302..ab880d26d 100644
--- a/readme.md
+++ b/readme.md
@@ -19,7 +19,7 @@
为了方便学习和交流,**我们建立了FunRec学习社区(微信群+知识星球)**,微信群方便大家平时日常交流和讨论,知识星球方便沉淀内容。由于我们的内容面向的人群主要是学生,所以**知识星球永久免费**,感兴趣的可以加入星球讨论(加入星球的同学先看置定的必读帖)!**FunRec学习社区内部会不定期分享(FunRec社区中爱分享的同学)技术总结、个人管理等内容,[跟技术相关的分享内容都放在了B站](https://space.bilibili.com/431850986/channel/collectiondetail?sid=339597)上面**。由于微信群的二维码只有7天内有效,所以直接加下面这个微信,备注:**Fun-Rec**,会被拉到Fun-Rec交流群,如果觉得微信群比较吵建议直接加知识星球!。
-
+
**注意:不建议直接在github上面阅读(公式图片容易解析错误),推荐点击上面的在线阅读或者离线下载下来之后使用markdown工具(如typora)查看!**
@@ -136,15 +136,15 @@
[2.1 竞赛实践(天池入门赛-新闻推荐)](https://tianchi.aliyun.com/competition/entrance/531842/forum)
**2.2 新闻推荐系统实践前端展示和后端逻辑(项目没有任何商用价值仅供入门者学习)**