nginx 编程 php 微软 mysql linux 云计算 java Python Firefox 程序员 centos 开源 google Windows Android Ubuntu shell apache wordpress

推薦系統分析 – 推薦算法, RecDB推薦數據庫介紹

背景

中華文化源遠流長,從古至今有很多有趣的學問。比如看風水、看相,在西方文化中有類似的比如星座學說。

這些和推薦系統有什麽關系呢?

個人感覺星座學說這些學問有一定的理論基礎,更多的則是也是經驗的總結。

推薦系統實際上和星座學說類似,有一定的算法基礎,算法則可能是來自經驗的總結。

在現實場景中,可以找到很多類似的例子,這些例子是一些數據,根據推薦算法,可以找到用戶可能感興趣的東西。

1. 比如豆瓣有幾千萬用戶,用戶在豆瓣上可以對影片打分和點評,就形成了一批這樣的數據:

用戶ID,影片ID,打分  
  
1, '終結者', 90  
902, '笑傲江湖', 78  
......  
  
我暫且稱之為打分表, 或者rating數據  
2. 電商應用,也有類似的rating數據,比如這樣的數據結構:

用戶ID,購買的商品ID,購買數量  
  
或者  
  
用戶ID,瀏覽的店鋪ID,瀏覽的次數  
  
或者  
  
用戶ID,瀏覽的商品ID,瀏覽次數  
3. 其他應用, 使用應用程序的feed,可以找到用戶與用戶,用戶與物體的親密度(rating)。

有了用戶與用戶,用戶與物體的親密度(rating)數據之後,就可以根據推薦算法,給用戶推薦其他用戶或物體。

推薦算法相關的文章可以參考

https://www.ibm.com/developerworks/cn/web/1103_zhaoct_recommstudy1/index.html

接下來容重介紹一款繼承了多種推薦算法的數據庫RecDB。

RecDB(推薦數據庫)

RecDB是基於PostgreSQL打造的一款專業的推薦數據庫,目前繼承了以下推薦算法。

Currently, the available recommendation algorithms that could be passed to the USING clause are the following:

ItemCosCF Item-Item Collaborative Filtering using Cosine Similarity measure.

ItemPearCF Item-Item Collaborative Filtering using Pearson Correlation Similarity measure.

UserCosCF User-User Collaborative Filtering using Cosine Similarity measure.

UserPearCF User-User Collaborative Filtering using Cosine Similarity measure.

SVD Simon Funk Singular Value Decomposition.

Applications powered by RecDB can produce online and flexible personalized recommendations to end-users.

RecDB的用法非常簡單,就如同使用PostgreSQL一樣,對用戶非常友好,同時擴展了一些推薦相關的SQL語法。

如果你還不會用PostgreSQL,可以來看看這個

《2011年功力的德哥教你2天擼通PostgreSQL - 入門、開發、原理、管理、調優》

RecDB安裝

1. 編譯安裝recdb

git clone https://github.com/Sarwat/recdb-postgresql.git  
  
cd recdb-postgresql/PostgreSQL  
  
./configure --prefix=/home/digoal/recdb  
  
make world -j 32  
  
make install-world  
  
export PS1="$USER@`/bin/hostname -s`-> "  
export PGPORT=9999  
export PGDATA=/home/digoal/pgdata/pg_root9999  
export LANG=en_US.utf8  
export PGHOME=/home/digoal/recdb  
export LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib:$LD_LIBRARY_PATH  
export LD_RUN_PATH=$LD_LIBRARY_PATH  
export date=`date +"%Y%m%d%H%M"`  
export PATH=$PGHOME/bin:$PATH:.  
export MANPATH=$PGHOME/share/man:$MANPATH  
export PGHOST=127.0.0.1  
export PGUSER=postgres  
export PGDATABASE=postgres  
alias rm='rm -i'  
alias ll='ls -lh'  
unalias vi  

2. 初始化数据库集群

initdb -D $PGDATA -E UTF8 --locale=C -U postgres  

3. 配置

cd $PGDATA  
  
vi postgresql.conf  
listen_addresses = '0.0.0.0'  
port = 9999   
max_connections = 100  
unix_socket_directory = '.'  
shared_buffers = 32GB  
maintenance_work_mem = 1GB  
vacuum_cost_delay = 0ms  
bgwriter_delay = 10ms  
bgwriter_lru_maxpages = 1000  
bgwriter_lru_multiplier = 5.0  
wal_level = minimal  
synchronous_commit = off  
wal_buffers = 128MB  
wal_writer_delay = 10ms  
checkpoint_segments = 2048    
checkpoint_timeout = 35min  
checkpoint_completion_target = 0.1  
random_page_cost = 1.2  
effective_cache_size = 128GB  
log_destination = 'csvlog'  
logging_collector = on  
log_directory = 'pg_log'  
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'  
log_truncate_on_rotation = on  
log_checkpoints = on  
log_connections = on  
log_disconnections = on  
log_error_verbosity = verbose  
log_timezone = 'PRC'  
log_autovacuum_min_duration = 0  
autovacuum_max_workers = 8  
autovacuum_naptime = 10s  
autovacuum_vacuum_scale_factor = 0.05  
autovacuum_analyze_scale_factor = 0.1  
autovacuum_freeze_max_age = 1500000000  
autovacuum_vacuum_cost_delay = 0  
vacuum_freeze_table_age = 1300000000  
datestyle = 'iso, mdy'  
timezone = 'PRC'  
lc_messages = 'C'  
lc_monetary = 'C'  
lc_numeric = 'C'  
lc_time = 'C'  
default_text_search_config = 'pg_catalog.english'  
  
  
vi pg_hba.conf  
host all all 0.0.0.0/0 md5  

4. 启动数据库集群

pg_ctl start  

5. 测试数据库是否可用

psql  
psql (9.2.0)  
type "help" for help.  
  
postgres=# \dt  
No relations found.  
postgres=# \l  
                             List of databases  
   Name    |  Owner   | Encoding | Collate | Ctype |   Access privileges     
-----------+----------+----------+---------+-------+-----------------------  
 postgres  | postgres | UTF8     | C       | C     |   
 template0 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
 template1 | postgres | UTF8     | C       | C     | =c/postgres          +  
           |          |          |         |       | postgres=CTc/postgres  
(3 rows)  

RecDB 使用例子

1. 创建测试表,三个字段,分别为用户ID,电影ID,评分。

postgres=# create table douban_rating(uid int, movie_id int, access_rating real);  
CREATE TABLE  

2. 插入100万评分数据,1000万用户ID,点评1000部电影

postgres=# insert into douban_rating select random()*10000000, random()*1000, random() from generate_series(1,1000000);  
INSERT 0 1000000  

3. 创建RECOMMENDER,需要指出推荐算法,字段。

推荐算法如下

  • ItemCosCF Item-Item Collaborative Filtering using Cosine Similarity measure.

  • ItemPearCF Item-Item Collaborative Filtering using Pearson Correlation Similarity measure.

  • UserCosCF User-User Collaborative Filtering using Cosine Similarity measure.

  • UserPearCF User-User Collaborative Filtering using Cosine Similarity measure.

  • SVD Simon Funk Singular Value Decomposition.

postgres=# set maintenance_work_mem ='32GB';  
SET  
postgres=# set work_mem ='32GB';  
SET  
postgres=# \timing  
Timing is on.  
  
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating  
USERS FROM uid  
ITEMS FROM movie_id  
EVENTS FROM access_rating  
USING ItemCosCF;  
  
NOTICE:  CREATE TABLE will create implicit sequence "movierecindex_systemid_seq" for serial column "movierecindex.systemid"  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movierecindex_pkey" for table "movierecindex"  
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "movierecview1491804367109912_pkey" for table "movierecview1491804367109912"  
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "movierecmodel1491804367109912_pkey" for table "movierecmodel1491804367109912"  
CREATE RECOMMENDER  
postgres=# \dt+  
                                  List of relations  
 Schema |             Name              | Type  |  Owner   |    Size    | Description   
--------+-------------------------------+-------+----------+------------+-------------  
 public | douban_rating                 | table | postgres | 42 MB      |   
 public | movierecindex                 | table | postgres | 16 kB      |   
 public | movierecmodel1491804035307409 | table | postgres | 2136 kB    |   
 public | movierecview1491804035307409  | table | postgres | 8192 bytes |   
 public | recdbproperties               | table | postgres | 8192 bytes |   
 public | recmodelscatalogue            | table | postgres | 16 kB      |   
(6 rows)  

查看数据

postgres=# select * from douban_rating limit 10;  
   uid   | movie_id | access_rating   
---------+----------+---------------  
  359604 |     4798 |      0.796408  
 7749197 |     1764 |      0.194022  
 9288272 |      947 |      0.696304  
 6032232 |     1738 |      0.745247  
 6451861 |     6278 |      0.416638  
 3290076 |      510 |      0.571497  
  334635 |     4904 |      0.552451  
 2313039 |     3906 |      0.963749  
 2535368 |     6377 |      0.297736  
 2939719 |     7603 |      0.624071  
(10 rows)  

查看某个用户的点评数据

postgres=# select * from douban_rating where uid=359604;  
  uid   | movie_id | access_rating   
--------+----------+---------------  
 359604 |     4798 |      0.796408  
(1 row)  

查看具有共性的用户,点评了哪些电影

postgres=# select distinct movie_id from douban_rating where uid in (select uid from douban_rating where movie_id=4798);  
 movie_id   
----------  
     2667  
     4798  
     1686  
     9008  
     3994  
     1156  
     1679  
     4103  
     1090  
(9 rows)  

4. 使用推荐算法,给uid=359604的用户,推荐其他电影,返回的是具有共性的电影,分值从高到低排序。

postgres=# SELECT * FROM douban_rating R  
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 359604  
ORDER BY R.access_rating desc  
LIMIT 10;  
  uid   | movie_id | access_rating   
--------+----------+---------------  
 359604 |     2667 |      0.796408  
 359604 |     9008 |      0.796408  
 359604 |     1090 |      0.796408  
 359604 |     1679 |      0.796408  
 359604 |     1686 |      0.796408  
 359604 |     4103 |      0.796408  
 359604 |     1156 |      0.796408  
 359604 |     3994 |      0.796408  
 359604 |        6 |             0  
 359604 |        5 |             0  
(10 rows)  

执行计划如下

postgres=# explain (analyze,verbose,timing,costs,buffers) SELECT * FROM douban_rating R  
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 359604  
ORDER BY R.access_rating desc  
LIMIT 10;  
                                                                    QUERY PLAN                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.091..3337.093 rows=10 loops=1)  
   Output: uid, movie_id, access_rating  
   Buffers: shared hit=61840, temp read=9993 written=9993  
   ->  sort  (cost=17906.01..17906.01 rows=1 width=12) (actual time=3337.090..3337.092 rows=10 loops=1)  
         Output: uid, movie_id, access_rating  
         Sort Key: r.access_rating  
         Sort Method: top-N heapsort  Memory: 25kB  
         Buffers: shared hit=61840, temp read=9993 written=9993  
         ->  Result  (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.739..3332.023 rows=10001 loops=1)  
               Output: uid, movie_id, access_rating  
               Buffers: shared hit=61840, temp read=9993 written=9993  
               ->  Recommend on public.douban_rating r  (cost=0.00..17906.00 rows=1 width=12) (actual time=2948.738..3329.177 rows=10001 loops=1)  
                     Output: uid, movie_id, access_rating  
                     Filter: (r.uid = 359604)  
                     Buffers: shared hit=61840, temp read=9993 written=9993  
 Total runtime: 3337.116 ms  
(16 rows)  

5. 由于共性用户较少,重新造一部分数据,让共性更多,1万用户点评1000部电影。

postgres=# drop RECOMMENDER MovieRec;  
DROP RECOMMENDER  
postgres=# truncate douban_rating ;  
TRUNCATE TABLE  
postgres=# insert into douban_rating select random()*10000, random()*1000, random() from generate_series(1,1000000);  
INSERT 0 1000000  
postgres=# CREATE RECOMMENDER MovieRec ON douban_rating  
USERS FROM uid  
ITEMS FROM movie_id  
EVENTS FROM access_rating  
USING ItemCosCF;  

查询原始数据如下

postgres=# select * from douban_rating limit 10;  
 uid  | movie_id | access_rating   
------+----------+---------------  
  128 |      848 |      0.796747  
 4953 |      401 |      0.832318  
 2766 |      874 |       0.61931  
 5572 |      430 |      0.550044  
    6 |      709 |      0.798314  
 1896 |      237 |      0.559974  
 4917 |      614 |      0.517259  
 6697 |      886 |      0.804338  
 2232 |      534 |      0.873135  
 4574 |      557 |       0.38828  
(10 rows)  
  
  
postgres=# select * from douban_rating where uid=128;  
 uid | movie_id | access_rating   
-----+----------+---------------  
 128 |      848 |      0.796747  
 128 |      755 |      0.139934  
 128 |       79 |      0.633511  
 128 |      979 |      0.145586  
 128 |      120 |      0.153884  
 128 |      839 |     0.0865545  
 ......  

7. 给用户128推荐电影,现在的结果有了一定的可参考性。

postgres=# SELECT * FROM MovieRec R                                                 
RECOMMEND R.movie_id TO R.uid ON R.access_rating USING ItemCosCF  
WHERE R.uid = 128  
ORDER BY R.access_rating desc  
LIMIT 10;  
 uid | movie_id | access_rating   
-----+----------+---------------  
 128 |      422 |      0.514567  
 128 |      127 |      0.514059  
 128 |      495 |      0.513637  
 128 |      974 |      0.513447  
 128 |      487 |      0.512524  
 128 |       64 |      0.512519  
 128 |      868 |      0.512367  
 128 |      132 |      0.512323  
 128 |        0 |       0.51225  
 128 |       54 |      0.512082  
(10 rows)  

8. recDB支持复杂的查询,例如join.

SELECT * FROM ml_ratings R, Movies M  
RECOMMEND R.itemid TO R.userid ON R.ratingval USING ItemCosCF  
WHERE R.userid = 1 AND M.movieid = R.itemid AND M.genre LIKE '%Comedy%'  
ORDER BY R.ratingval  
LIMIT 10  

更多例子

https://github.com/DataSystemsLab/recdb-postgresql/tree/master/examples

小結

1. 未來RecDB可能會整合到PG內核,也可能作為一款PG的插件(可能性較大),Pipelinedb(流式計算數據庫)也在做插件化的改造。

目前基於PG改造或擴展的產品非常的多,包括

1. 流計算數據庫產品 pipelineDB

2. 推薦數據庫產品 recDB

3. 時序數據庫 timescaleDB

4. 分布式數據庫插件 citus

5. 列存儲插件 IMCS, cstore等

6. 面向OLAP的codegen數據庫 pg_LLVM

7. 向量計算插件 vops

。。。

不同的場景,可以找到適合對應場景的插件。

2. recDB實現的推薦查詢,與圖數據也有一定的類似性,PostgreSQL在圖式搜索方面的應用可以參考如下

延伸阅读

    评论