Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modified codes for python3 use and update the dataset link #17

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,11 @@ init: $(ORIGINALFOLDER)
cp $(ORIGINALFOLDER)/training2nd/clk.*.bz2 $(TRAIN)
cp $(ORIGINALFOLDER)/training3rd/imp.*.bz2 $(TRAIN)
cp $(ORIGINALFOLDER)/training3rd/clk.*.bz2 $(TRAIN)
bzip2 -d $(TRAIN)/*
pbzip2 -d $(TRAIN)/*
mkdir -p $(TEST)
cp $(ORIGINALFOLDER)/testing2nd/* $(TEST)
cp $(ORIGINALFOLDER)/testing3rd/* $(TEST)
bzip2 -d $(TEST)/*
pbzip2 -d $(TEST)/*
mkdir $(BASE)/all

clk: $(TRAIN)
Expand Down
22 changes: 16 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,28 @@ make-ipinyou-data

This project is to formalise the iPinYou RTB data into a standard format for further researches.

**You should run these codes on Linux or WSL for preventing unexpected errors.**

### Step 0
The raw data of iPinYou (`ipinyou.contest.dataset.zip`) can be downloaded from [UCL website](http://bunwell.cs.ucl.ac.uk/ipinyou.contest.dataset.zip).
The raw data of iPinYou (`ipinyou.contest.dataset.zip`) can be downloaded from [Kaggle](https://www.kaggle.com/datasets/lastsummer/ipinyou).

Unzip it and get the folder `ipinyou.contest.dataset`.

To speed up the process of bzip2, install `pbzip2`.
```
# for example on Ubuntu
sudo apt-get update
sudo apt-get install pbzip2
```

### Step 1
Update the soft link for the folder `ipinyou.contest.dataset` in `original-data`.
```
weinan@ZHANG:~/Project/make-ipinyou-data/original-data$ ln -sfn ~/Data/ipinyou.contest.dataset ipinyou.contest.dataset
make-ipinyou-data/original-data$ ln -sfn ~/Data/ipinyou.contest.dataset ipinyou.contest.dataset
```
Under `make-ipinyou-data/original-data/ipinyou.contest.dataset` there should be the original dataset files like this:
```
weinan@ZHANG:~/Project/make-ipinyou-data/original-data/ipinyou.contest.dataset$ ls
make-ipinyou-data/original-data/ipinyou.contest.dataset$ ls
algo.submission.demo.tar.bz2 README testing2nd training3rd
city.cn.txt region.cn.txt testing3rd user.profile.tags.cn.txt
city.en.txt region.en.txt training1st user.profile.tags.en.txt
Expand All @@ -28,7 +37,7 @@ Under `make-ipinyou-data` folder, just run `make all`.

After the program finished, the total size of the folder will be 14G. The files under `make-ipinyou-data` should be like this:
```
weinan@ZHANG:~/Project/make-ipinyou-data$ ls
make-ipinyou-data$ ls
1458 2261 2997 3386 3476 LICENSE mkyzxdata.sh python schema.txt
2259 2821 3358 3427 all Makefile original-data README.md
```
Expand All @@ -37,12 +46,13 @@ Normally, we only do experiment for each campaign (e.g. `1458`). `all` is just t
### Use of the data
We use campaign 1458 as example here.
```
weinan@ZHANG:~/Project/make-ipinyou-data/1458$ ls
make-ipinyou-data/1458$ ls
featindex.txt test.log.txt test.yzx.txt train.log.txt train.yzx.txt
```
* `train.log.txt` and `test.log.txt` are the formalised string data for each row (record) in train and test. The first column is whether the user click the ad or not. The 14th column is the winning price for this auction.
* `featindex.txt`maps the features to their indexes. For example, `8:115.45.195.* 29` means that the 8th column in `train.log.txt` with the string `115.45.195.*` maps to feature index `29`.
* `train.yzx.txt` and `test.yzx.txt` are the mapped vector data for `train.log.txt` and `test.log.txt`. The format is y:click, z:wining_price, and x:features. Such data is in the standard form as introduced in [iPinYou Benchmarking](http://arxiv.org/abs/1407.7073).


For any questions, please report the issues or contact [Weinan Zhang](http://www0.cs.ucl.ac.uk/staff/w.zhang/).
For any questions, please report the issues or contact [Weinan Zhang](http://www0.cs.ucl.ac.uk/staff/w.zhang/) or [frinkleko](https://github.com/frinkleko)

2 changes: 1 addition & 1 deletion mkyzxdata.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@ advertisers="1458 2261 2997 3386 3476 2259 2821 3358 3427"

for advertiser in $advertisers; do
echo $advertiser
python python/mkyzx.py $advertiser/train.log.txt $advertiser/test.log.txt $advertiser/train.yzx.txt $advertiser/test.yzx.txt $advertiser/featindex.txt
python3 python/mkyzx.py $advertiser/train.log.txt $advertiser/test.log.txt $advertiser/train.yzx.txt $advertiser/test.yzx.txt $advertiser/featindex.txt
done

4 changes: 2 additions & 2 deletions python/formalizeua.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/usr/bin/python
#!/usr/bin/python3
import sys
import os

if len(sys.argv) < 2:
print 'Usage: input'
print('Usage: input')
exit(-1)


Expand Down
4 changes: 2 additions & 2 deletions python/mkdata.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/usr/bin/python
#!/usr/bin/python3
import sys
from datetime import date

if len(sys.argv) < 3:
print 'Usage: schema clickfiles'
print('Usage: schema clickfiles')
exit(-1)

schema = [ s.strip() for s in open(sys.argv[1]).read().split() ]
Expand Down
4 changes: 2 additions & 2 deletions python/mktest.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/usr/bin/python
#!/usr/bin/python3
import sys
from datetime import date

if len(sys.argv) < 2:
print 'Usage: schema '
print('Usage: schema ')
exit(-1)

schema = [ s.strip() for s in open(sys.argv[1]).read().split() ]
Expand Down
16 changes: 8 additions & 8 deletions python/mkyzx.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/usr/bin/python
#!/usr/bin/python3
import sys
import operator

if len(sys.argv) < 5:
print 'Usage: train.log.txt test.log.txt train.lr.txt test.lr.txt featindex.txt'
print('Usage: train.log.txt test.log.txt train.lr.txt test.lr.txt featindex.txt')
exit(-1)

oses = ["windows", "ios", "mac", "android", "linux"]
Expand Down Expand Up @@ -89,15 +89,15 @@ def getTags(content):
featindex[feat] = maxindex
maxindex += 1

print 'feature size: ' + str(maxindex)
featvalue = sorted(featindex.iteritems(), key=operator.itemgetter(1))
print('feature size: ' + str(maxindex))
featvalue = sorted(featindex.items(), key=operator.itemgetter(1))
fo = open(sys.argv[5], 'w')
for fv in featvalue:
fo.write(fv[0] + '\t' + str(fv[1]) + '\n')
fo.close()

# indexing train
print 'indexing ' + sys.argv[1]
print('indexing ' + sys.argv[1])
fi = open(sys.argv[1], 'r')
fo = open(sys.argv[3], 'w')

Expand Down Expand Up @@ -138,7 +138,7 @@ def getTags(content):
fo.close()

# indexing test
print 'indexing ' + sys.argv[2]
print('indexing ' + sys.argv[2])
fi = open(sys.argv[2], 'r')
fo = open(sys.argv[4], 'w')

Expand All @@ -154,8 +154,8 @@ def getTags(content):
for f in f1s: # every direct first order feature
col = namecol[f]
if col >= len(s):
print 'col: ' + str(col)
print line
print('col: ' + str(col))
print(line)
content = s[col]
feat = str(col) + ':' + content
if feat not in featindex:
Expand Down
4 changes: 2 additions & 2 deletions python/splitadvertisers.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/usr/bin/python
#!/usr/bin/python3
import sys
import os

if len(sys.argv) < 5:
print 'Usage: ipinyou.folder 25 train.log.txt test.log.txt'
print('Usage: ipinyou.folder 25 train.log.txt test.log.txt')
# python splitadvertisers.py ../ 25 ../all/train.log.txt ../all/test.log.txt
exit(-1)

Expand Down