分析 PyPI 软件包的下载情况

本节介绍了如何使用公共的 PyPI 下载统计数据集来了解PyPI上托管的软件包 (packages) 的下载情况。例如,您可以用它来发现用于下载软件包的 Python 版本的分布。

背景

由于一些原因,PyPI 不显示下载统计数据。1

  • Inefficient to make work with a Content Distribution Network (CDN): Download statistics change constantly. Including them in project pages, which are heavily cached, would require invalidating the cache more often, and reduce the overall effectiveness of the cache.

  • Highly inaccurate: A number of things prevent the download counts from being accurate, some of which include:

    • pip 的下载缓存(降低下载次数)

    • 内部或非官方的镜像(既可以提高也可以降低下载量)

    • 未在 PyPI 上托管的软件包(为了比较)

    • Unofficial scripts or attempts at download count inflation (raises download counts)

    • Known historical data quality issues (lowers download counts)

  • Not particularly useful: Just because a project has been downloaded a lot doesn’t mean it’s good; Similarly just because a project hasn’t been downloaded a lot doesn’t mean it’s bad!

In short, because it’s value is low for various reasons, and the tradeoffs required to make it work are high, it has been not an effective use of limited resources.

公共数据集

As an alternative, the Linehaul project streams download logs from PyPI to Google BigQuery 2, where they are stored as a public dataset.

开始设置

In order to use Google BigQuery to query the public PyPI download statistics dataset, you’ll need a Google account and to enable the BigQuery API on a Google Cloud Platform project. You can run up to 1TB of queries per month using the BigQuery free tier without a credit card

关于如何开始使用 BigQuery 的更多详细说明,请查看`BigQuery 快速入门指南 <https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui>`__ 。

Data schema

Linehaul writes an entry in a bigquery-public-data.pypi.file_downloads table for each download. The table contains information about what file was downloaded and how it was downloaded. Some useful columns from the table schema include:

描述

例子

时间戳 (timestamp)

日期和时间 (Date and time)

2020-03-09 00:33:03 UTC

file.project

项目名称

pipenv, nose

file.version

包版本

0.1.6, 1.4.2

details.installer.name

安装程序

pip, bandersnatch

details.python

Python 版本

2.7.12, 3.6.4

有用的查询

Run queries in the BigQuery web UI by clicking the “Compose query” button.

Note that the rows are stored in a partitioned, which helps limit the cost of queries. These example queries analyze downloads from recent history by filtering on the timestamp column.

-计算软件包的下载量

下面的查询统计了 “pytest ” 项目的总下载次数。

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `bigquery-public-data.pypi.file_downloads`
WHERE file.project = 'pytest'
  -- Only query the last 30 days of history
  AND DATE(timestamp)
    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
    AND CURRENT_DATE()

num_downloads

26190085

要想只计算来自 pip 的下载量,可以在 details.installer.name 列上过滤。

#standardSQL
SELECT COUNT(*) AS num_downloads
FROM `bigquery-public-data.pypi.file_downloads`
WHERE file.project = 'pytest'
  AND details.installer.name = 'pip'
  -- Only query the last 30 days of history
  AND DATE(timestamp)
    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
    AND CURRENT_DATE()

num_downloads

24334215

Package downloads over time

要按月下载量分组,请使用 TIMESTAMP_TRUNC 函数。同时按这一栏过滤可以减少相应的费用。

#standardSQL
SELECT
  COUNT(*) AS num_downloads,
  DATE_TRUNC(DATE(timestamp), MONTH) AS `month`
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
  file.project = 'pytest'
  -- Only query the last 6 months of history
  AND DATE(timestamp)
    BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), MONTH)
    AND CURRENT_DATE()
GROUP BY `month`
ORDER BY `month` DESC

num_downloads

month

1956741

2018-01-01

2344692

2017-12-01

1730398

2017-11-01

2047310

2017-10-01

1744443

2017-09-01

1916952

2017-08-01

Python versions over time

details.python 列中提取 Python 版本。警告:这个查询处理超过 500GB 的数据。

#standardSQL
SELECT
  REGEXP_EXTRACT(details.python, r"[0-9]+\.[0-9]+") AS python_version,
  COUNT(*) AS num_downloads,
FROM `bigquery-public-data.pypi.file_downloads`
WHERE
  -- Only query the last 6 months of history
  DATE(timestamp)
    BETWEEN DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH), MONTH)
    AND CURRENT_DATE()
GROUP BY `python_version`
ORDER BY `num_downloads` DESC

python

num_downloads

3.7

18051328726

3.6

9635067203

3.8

7781904681

2.7

6381252241

2026630299

3.5

1894153540

注意事项

In addition to the caveats listed in the background above, Linehaul suffered from a bug which caused it to significantly under-report download statistics prior to July 26, 2018. Downloads before this date are proportionally accurate (e.g. the percentage of Python 2 vs. Python 3 downloads) but total numbers are lower than actual by an order of magnitude.

附加工具

除了使用 BigQuery 控制台,还有一些额外的工具,在分析下载统计数据时可能很有用。

google-cloud-bigquery

您也可以通过 BigQuery API 和 BigQuery 的官方 Python 客户端库 google-cloud-bigquery 项目,以编程方式访问公共 PyPI 下载统计数据集。

from google.cloud import bigquery

# Note: depending on where this code is being run, you may require
# additional authentication. See:
# https://cloud.google.com/bigquery/docs/authentication/
client = bigquery.Client()

query_job = client.query("""
SELECT COUNT(*) AS num_downloads
FROM `bigquery-public-data.pypi.file_downloads`
WHERE file.project = 'pytest'
  -- Only query the last 30 days of history
  AND DATE(timestamp)
    BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
    AND CURRENT_DATE()""")

results = query_job.result()  # Waits for job to complete.
for row in results:
    print("{} downloads".format(row.num_downloads))

pypinfo

pypinfo 是一个命令行工具,它提供对数据集的访问,并可以生成一些有用的查询。例如,你可以用 pypinfo package_name 命令来查询某个软件包的总下载次数。

使用 pip 安装 pypinfo

python -m pip install pypinfo

使用方法:

$ pypinfo requests
Served from cache: False
Data processed: 6.87 GiB
Data billed: 6.87 GiB
Estimated cost: $0.04

| download_count |
| -------------- |
|      9,316,415 |

pandas-gbq

pandas-gbq 项目允许通过 Pandas 来访问查询结果。

参考

1

PyPI 下载计数的废弃电子邮件

2

PyPI BigQuery 数据集公告邮件