kaggle使用笔记

因为参加了DCASE2018比赛的声学场景分类的子任务,这个比赛有个排行榜是用的kaggle来做的,所以在比赛中,用到过kaggle API,下面是关于kaggle的使用笔记。

kaggle 是什么?

Kaggle是一个数据科学竞赛的平台,很多公司会发布一些接近真实业务的问题,吸引爱好数据科学的人来一起解决。

点击导航栏的 competitions 可以看到有很多比赛,其中正式比赛,一般会有奖金或者工作机会,除了正式比赛还有一些为初学者提供的 playground,在这里可以先了解这个比赛,练习能力,再去参加正式比赛。

参赛方法

参赛之前,首先需要一个kaggle的账号,激活之后,找到自己感兴趣的competitions,然后选择“join competitions”即可。

界面介绍:

  • Overview: 首先在 overview 中仔细阅读问题的描述,这个比赛是让我们预测房价,它会给我们 79 个影响房价的变量,我们可以通过应用 random forest,gradient boosting 等算法,来对房价进行预测。

  • Data:在这里给我们提供了 train 数据集,用来训练模型;test 数据集,用来将训练好的模型应用到这上面,进行预测,这个结果也是要提交到系统进行评价的;sample_submission 就是我们最后提交的 csv 文件中,里面的列的格式需要和这里一样。

  • Kernels:可以看到一些参赛者分享的代码。

  • Discussion:参赛者们可以在这里提问,分享经验。

  • Leaderboard:就是参赛者的排行榜。

参赛流程

第一步:在 Data 里面下载三个数据集,最基本的就是上面提到的三个文件,有些比赛会有附加的数据描述文件等。

第二步:自己在线下分析,建模,调参,把用 test 数据集预测好的结果,按照 sample_submission 的格式输出到 csv 文件中。

第三步:点击蓝色按钮 ’Submit Predictions’ ,把 csv 文件拖拽进去,然后系统就会加载并检验结果,稍等片刻后就会在 Leaderboard 上显示当前结果所在的排名位置。
上传过一次结果之后,就直接加入了这场比赛。

注意:正式比赛中每个团队每天有 5 次的上传机会,然后就要等 24 小时再次传结果,playground 的是 9 次。

kaggle API的安装及使用

安装方法

首先确保安装了Python和包管理器pip。

运行以下命令以使用命令行访问Kaggle API:

1
2
3
4
// Windows系统,默认的安装目录是“$ PYTHON_HOME / Scripts”
pip install kaggle
// Mac / Linux系统
pip install --user kaggle

下载API credentials

  • 要使用Kaggle API,需要在kaggle官网上注册Kaggle帐户。

  • 转到用户个人资料的’Account’标签,然后选择“create API Token”之后会弹出kaggle.json的下载,这是一个包含API credentials的文件。

  • 将此文件放在〜/ .kaggle / kaggle.json位置(在Windows上的位置C:\ Users \ <Windows-username> \ .kaggle \ kaggle.json)。

第一次安装的时候,再C:\ Users \ <Windows-username> \ .kaggle \ kaggle.json目录下没有.kaggle这个文件夹,后来通过pip uninstall kaggle再重新安装之后,自动出现.kaggle文件夹,随后直接将kaggle.json文件复制到这个文件夹下面了。

您可以定义一个shell环境变量KAGGLE_CONFIG_DIR来将此位置更改为$ KAGGLE_CONFIG_DIR / kaggle.json(在Windows上它将是%KAGGLE_CONFIG_DIR%\ kaggle.json)。

命令

命令行支持命令:

1
2
3
kaggle competitions {list,files,download,submit,submissions,leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle config {view, set, unset}
  • 比赛——API支持Kaggle Competitions的命令

  • List competitions

1
2
3
4
5
6
7
8
9
usage: kaggle competitions list [-h] [-p PAGE] [-s SEARCH] [-v]
optional arguments:
-h, --help show this help message and exit
-p PAGE, --page PAGE page number
-s SEARCH, --search SEARCH
term(s) to search for
-v, --csv print in CSV format
(if not set print in table format)

例子:

1
kaggle competitions list -s health
  • List competition files
1
2
3
4
5
6
7
8
9
usage: kaggle competitions files [-h] [-c COMPETITION] [-v] [-q]
optional arguments:
-h, --help show this help message and exit
-c COMPETITION, --competition COMPETITION
Competition URL suffix (use "kaggle competitions list" to show options)
If empty, the default competition will be used (use "kaggle config set competition")"
-v, --csv Print results in CSV format (if not set print in table format)
-q, --quiet Suppress printing information about download progress

例子:

1
kaggle competitions files -c favorita-grocery-sales-forecasting
  • Download competition files
1
2
3
4
5
6
7
8
9
10
11
12
13
14
usage: kaggle competitions download [-h] [-c COMPETITION] [-f FILE] [-p PATH]
[-w] [-o] [-q]
optional arguments:
-h, --help show this help message and exit
-c COMPETITION, --competition COMPETITION
Competition URL suffix (use "kaggle competitions list" to show options)
If empty, the default competition will be used (use "kaggle config set competition")"
-f FILE, --file FILE File name, all files downloaded if not provided
(use "kaggle competitions files -c <competition>" to show options)
-p PATH, --path PATH Folder where file(s) will be downloaded, defaults to ~/.kaggle
-w, --wp Download files to current working path
-o, --force Skip check whether local version of file is up to date, force file download
-q, --quiet Suppress printing information about download progress

例子:

1
2
kaggle competitions download -c favorita-grocery-sales-forecasting
kaggle competitions download -c favorita-grocery-sales-forecasting -f test.csv.7z
  • Submit to a competition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
usage: kaggle competitions submit [-h] [-c COMPETITION] -f FILE -m MESSAGE
[-q]
required arguments:
-f FILE, --file FILE File for upload (full path)
-m MESSAGE, --message MESSAGE
Message describing this submission
optional arguments:
-h, --help show this help message and exit
-c COMPETITION, --competition COMPETITION
Competition URL suffix (use "kaggle competitions list" to show options)
If empty, the default competition will be used (use "kaggle config set competition")"
-q, --quiet Suppress printing information about download progress

例子:

1
kaggle competitions submit -c favorita-grocery-sales-forecasting -f sample_submission_favorita.csv.7z -m "My submission message"
  • List competition submissions
1
2
3
4
5
6
7
8
9
usage: kaggle competitions submissions [-h] [-c COMPETITION] [-v] [-q]
optional arguments:
-h, --help show this help message and exit
-c COMPETITION, --competition COMPETITION
Competition URL suffix (use "kaggle competitions list" to show options)
If empty, the default competition will be used (use "kaggle config set competition")"
-v, --csv Print results in CSV format (if not set print in table format)
-q, --quiet Suppress printing information about download progress

例子:

1
kaggle competitions submissions -c favorita-grocery-sales-forecasting
  • Get competition leaderboard
1
2
3
4
5
6
7
8
9
10
11
12
usage: kaggle competitions leaderboard [-h] [-c COMPETITION] [-s] [-d]
[-p PATH] [-q]
optional arguments:
-h, --help show this help message and exit
-c COMPETITION, --competition COMPETITION
Competition URL suffix (use "kaggle competitions list" to show options)
If empty, the default competition will be used (use "kaggle config set competition")"
-s, --show Show the top of the leaderboard
-d, --download Download entire leaderboard
-p PATH, --path PATH Folder where file(s) will be downloaded, defaults to ~/.kaggle
-q, --quiet Suppress printing information about download progress

例子:

1
kaggle competitions leaderboard -c favorita-grocery-sales-forecasting -s

数据集——API支持以下用于Kaggle数据集的命令。

  • List datasets
1
2
3
4
5
6
7
8
usage: kaggle datasets list [-h] [-p PAGE] [-s SEARCH] [-v]
optional arguments:
-h, --help show this help message and exit
-p PAGE, --page PAGE Page number for results paging
-s SEARCH, --search SEARCH
Term(s) to search for
-v, --csv Print results in CSV format (if not set print in table format)

例子:

1
kaggle datasets list -s demographics
  • List files for a dataset
1
2
3
4
5
6
7
8
9
usage: kaggle datasets files [-h] -d DATASET [-v]
required arguments:
-d DATASET, --dataset DATASET
Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)
optional arguments:
-h, --help show this help message and exit
-v, --csv Print results in CSV format (if not set print in table format)

例子:

1
kaggle datasets files -d zillow/zecon
  • Download dataset files
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
usage: kaggle datasets download [-h] -d DATASET [-f FILE] [-p PATH] [-w] [-o]
[-q]
required arguments:
-d DATASET, --dataset DATASET
Dataset URL suffix in format <owner>/<dataset-name> (use "kaggle datasets list" to show options)
optional arguments:
-h, --help show this help message and exit
-f FILE, --file FILE File name, all files downloaded if not provided
(use "kaggle datasets files -d <dataset>" to show options)
-p PATH, --path PATH Folder where file(s) will be downloaded, defaults to ~/.kaggle
-w, --wp Download files to current working path
-o, --force Skip check whether local version of file is up to date, force file download
-q, --quiet Suppress printing information about download progress

例子:

1
2
3
kaggle datasets download -d zillow/zecon
kaggle datasets download -d zillow/zecon -f State_time_series.csv
  • Initialize metadata file for dataset creation
1
2
3
4
5
6
7
8
usage: kaggle datasets init [-h] -p FOLDER
required arguments:
-p FOLDER, --path FOLDER
Folder for upload, containing data files and a special metadata.json file (https://github.com/Kaggle/kaggle-api/wiki/Metadata)
optional arguments:
-h, --help show this help message and exit

例子:

1
kaggle datasets init -p /path/to/dataset
  • Create a new dataset
1
2
3
4
5
6
7
8
9
10
11
usage: kaggle datasets create [-h] -p FOLDER [-u] [-q]
required arguments:
-p FOLDER, --path FOLDER
Folder for upload, containing data files and a special metadata.json file (https://github.com/Kaggle/kaggle-api/wiki/Metadata)
optional arguments:
-h, --help show this help message and exit
-u, --public Create the Dataset publicly (default is private)
-q, --quiet Suppress printing information about download progress
-t, --keep-tabular Do not convert tabular files to CSV (default is to convert)

例子:

1
kaggle datasets create -p /path/to/dataset
  • Create a new dataset version
1
2
3
4
5
6
7
8
9
10
11
12
13
14
usage: kaggle datasets version [-h] -m VERSION_NOTES -p FOLDER [-q]
required arguments:
-m VERSION_NOTES, --message VERSION_NOTES
Message describing the new version
-p FOLDER, --path FOLDER
Folder for upload, containing data files and a special metadata.json file (https://github.com/Kaggle/kaggle-api/wiki/Metadata)
optional arguments:
-h, --help show this help message and exit
-q, --quiet Suppress printing information about download progress
-t, --keep-tabular Do not convert tabular files to CSV (default is to convert)
-d, --delete-old-versions
Delete old versions of this dataset

例子:

1
kaggle datasets version -p /path/to/dataset -m "Updated data"

配置

  • View current config values
1
2
3
4
5
6
7
8
9
10
11
12
13
14
usage: kaggle config path [-h] [-p PATH]
optional arguments:
-h, --help show this help message and exit
-p PATH, --path PATH folder where file(s) will be downloaded, defaults to ~/.kaggle
Example:
kaggle config path -p C:\
View current config values
usage: kaggle config view [-h]
optional arguments:
-h, --help show this help message and exit

例子:

1
kaggle config view
  • Set a configuration value
1
2
3
4
5
6
7
8
9
10
usage: kaggle config set [-h] -n NAME -v VALUE
required arguments:
-n NAME, --name NAME Name of the configuration parameter
(one of competition, path, proxy)
-v VALUE, --value VALUE
Value of the configuration parameter, valid values depending on name
- competition: Competition URL suffix (use "kaggle competitions list" to show options)
- path: Folder where file(s) will be downloaded, defaults to ~/.kaggle
- proxy: Proxy for HTTP requests

例子:

1
kaggle config set -n competition -v titanic
  • Clear a configuration value
1
2
3
4
5
usage: kaggle config unset [-h] -n NAME
required arguments:
-n NAME, --name NAME Name of the configuration parameter
(one of competition, path, proxy)

例子:

1
kaggle config unset -n competition

注意:目前最大的限制是此时不以任何方式支持内核。 我们打算在不久的将来实施支持,尽管没有ETA。 此外,目前无法使用大型数据集(> = 2GB)。

参考

  1. kaggle官网
  2. kaggle API
  3. 从0到1走进 Kaggle
  4. 一个框架解决几乎所有机器学习问题
  5. Kaggle比赛:从何着手?
  6. kaggle入门(python数据处理)