使用Python下载Kaggle数据集

7kjnsjlb  于 2024-01-05  发布在  Python
关注(0)|答案(9)|浏览(134)

我试图通过使用Python下载kaggle dataset。然而,我使用request方法时遇到了问题,下载的输出.csv文件是损坏的html文件。

import requests

# The direct link to the Kaggle data set
data_url = 'https://www.kaggle.com/crawford/gene-expression/downloads/actual.csv'

# The local path where the data set is saved.
local_filename = "actsual.csv"

# Kaggle Username and Password
kaggle_info = {'UserName': "myUsername", 'Password': "myPassword"}

# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)

# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data = kaggle_info)

# Writes the data to a local file one chunk at a time.
f = open(local_filename, 'wb')
for chunk in r.iter_content(chunk_size = 512 * 1024): # Reads 512KB at a time into memory

    if chunk: # filter out keep-alive new chunks
        f.write(chunk)
f.close()

字符串
输出文件

<!DOCTYPE html>
<html>
<head>
    <title>Gene expression dataset (Golub et al.) | Kaggle</title>
    <meta charset="utf-8" />
    <meta name="robots" content="index, follow"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0">    <meta name="theme-color" content="#008ABC" />
    <link rel="dns-prefetch" href="https://www.google-analytics.com" /><link rel="dns-prefetch" href="https://stats.g.doubleclick.net" /><link rel="dns-prefetch" href="https://js.intercomcdn.com" /><link rel="preload" href="https://az416426.vo.msecnd.net/scripts/a/ai.0.js" as=script /><link rel="dns-prefetch" href="https://kaggle2.blob.core.windows.net" />
    <link href="/content/v/d420a040e581/kaggle/favicon.ico" rel="shortcut icon" type="image/x-icon" />
    <link rel="manifest" href="/static/json/manifest.json">
    <link href="//fonts.googleapis.com/css?family=Open+Sans:400,300,300italic,400italic,600,600italic,700,700italic" rel='stylesheet' type='text/css'>
                    <link rel="stylesheet" type="text/css" href="/static/assets/vendor.css?v=72f4ef2ebe4f"/>
        <link rel="stylesheet" type="text/css" href="/static/assets/app.css?v=d997fa977b65"/>
        <script>

            (function () {
                var originalError = window.onerror;

                window.onerror = function (message, url, lineNumber, columnNumber, error) {
                    var handled = originalError && originalError(message, url, lineNumber, columnNumber, error);
                    var blockedByCors = message && message.toLowerCase().indexOf("script error") >= 0;
                    return handled || blockedByCors;
                };
            })();
        </script>
    <script>
        var appInsights=window.appInsights||function(config){
        function i(config){t[config]=function(){var i=arguments;t.queue.push(function(){t[config].apply(t,i)})}}var t={config:config},u=document,e=window,o="script",s="AuthenticatedUserContext",h="start",c="stop",l="Track",a=l+"Event",v=l+"Page",y=u.createElement(o),r,f;y.src=config.url||"https://az416426.vo.msecnd.net/scripts/a/ai.0.js";u.getElementsByTagName(o)[0].parentNode.appendChild(y);try{t.cookie=u.cookie}catch(p){}for(t.queue=[],t.version="1.0",r=["Event","Exception","Metric","PageView","Trace","Dependency"];r.length;)i("track"+r.pop());return i("set"+s),i("clear"+s),i(h+a),i(c+a),i(h+v),i(c+v),i("flush"),config.disableExceptionTracking||(r="onerror",i("_"+r),f=e[r],e[r]=function(config,i,u,e,o){var s=f&&f(config,i,u,e,o);return s!==!0&&t["_"+r](config,i,u,e,o),s}),t
        }({
            instrumentationKey:"5b3d6014-f021-4304-8366-3cf961d5b90f",
            disableAjaxTracking: true
        });
        window.appInsights=appInsights;
        appInsights.trackPageView();
    </script>

hpcdzsge

hpcdzsge1#

基本上,如果你想使用KagglepythonAPI(@minh-triet提供的解决方案是针对命令行的而不是针对python的),你必须做以下事情:

import kaggle

kaggle.api.authenticate()

kaggle.api.dataset_download_files('The_name_of_the_dataset', path='the_path_you_want_to_download_the_files_to', unzip=True)

字符串
我希望这能帮上忙。

moiiocjp

moiiocjp2#

kaggle API密钥和usersame在kaggle配置文件页面上可用,数据集下载链接在kaggle的数据集详细信息页面上可用

#Set the enviroment variables
import os
os.environ['KAGGLE_USERNAME'] = "xxxx"
os.environ['KAGGLE_KEY'] = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
!kaggle competitions download -c dogs-vs-cats-redux-kernels-edition

字符串

bwitn5fc

bwitn5fc3#

我建议检查Kaggle API,而不是使用自己的代码。根据最新版本,下载数据集的示例命令是kaggle datasets download -d zillow/zecon

jhkqcmku

jhkqcmku4#

在任何事情之前:

pip install kaggle

字符串
对于数据集:

import os
os.environ['KAGGLE_USERNAME'] = "uname" # username from the json file
os.environ['KAGGLE_KEY'] = "kaggle_key" # key from the json file
!kaggle datasets download -d zynicide/wine-reviews


对于比赛:

import os
os.environ['KAGGLE_USERNAME'] = "uname" # username from the json file
os.environ['KAGGLE_KEY'] = "kaggle_key" # key from the json file
!kaggle competitions download -c dogs-vs-cats-redux-kernels-edition


不久前,我提供了另一个similar answer

yquaqz18

yquaqz185#

示例Download_Kaggle_Dataset_To_Colab的完整版本,在Windows下为我开始工作

#Step1
#Input:
from google.colab import files
files.upload()  #this will prompt you to upload the kaggle.json. Download from Kaggle>Kaggle API-file.json. Save to PC to PC folder and choose it here

#Output Sample:
#kaggle.json
#kaggle.json(application/json) - 69 bytes, last modified: 29.06.2021 - 100% done
#Saving kaggle.json to kaggle.json
#{'kaggle.json': 
#b'{"username":"sergeysukhov7","key":"23d4d4abdf3bee8ba88e653cec******"}'}

#Step2
#Input:
!pip install -q kaggle
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json  # set permission

#Output:
#kaggle.json

#Step3
#Input:
#Set the enviroment variables
import os
os.environ['KAGGLE_USERNAME'] = "sergeysukhov7"  #manually input My_Kaggle User_Name 
os.environ['KAGGLE_KEY'] = "23d4d4abdf3bee8ba88e653cec5*****"  #manually input My_Kaggle Key 

#Step4
#!kaggle datasets download -d zillow/zecon #download dataset to default folder content/zecon.zip if I want 

#find kaggle dataset link (for example) https://www.kaggle.com/willkoehrsen/home-credit-default-risk-feature-tools and choose part_of_the_link - willkoehrsen/home-credit-default-risk-feature-tools
#set link_from Kaggle willkoehrsen/home-credit-default-risk-feature-tools
#set Colab folder download_to  /content/gdrive/My Drive/kaggle/credit/home-credit-default-risk-feature-tools.zip
!kaggle datasets download -d willkoehrsen/home-credit-default-risk-feature-tools -p /content/gdrive/My\ Drive/kaggle/credit 

#Output
#Downloading home-credit-default-risk-feature-tools.zip to /content/gdrive/My Drive/kaggle/credit
#100% 3.63G/3.63G [01:31<00:00, 27.6MB/s]
#100% 3.63G/3.63G [01:31<00:00, 42.7MB/s]

字符串

zlwx9yxi

zlwx9yxi6#

我真的很难使用Kaggle API,所以我使用opendatasets。重要的是要将kaggle.json与笔记本放在同一个文件夹中。

pip install opendatasets

import opendatasets as od

od.download("https://www.kaggle.com/competitions/tlvmc-parkinsons-freezing-gait-prediction/data","/mypath/goes/here")

字符串
Documentation

b1payxdu

b1payxdu7#

我测试了这里提供的一些解决方案,但有些已经过时了:12/22/2023。下面是我在jupyter lab下为pyton 3.12.1实现和测试的内容:

import os

""" Downloads competition files from Kaggle assuming you previously downloaded the 
kaggle.json file and put in the location indicated here: 
https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials. 
It unzips the files and place them in the dataDir location. If the folder contains 
information it deletes it before. Once the information is unziped, 
it removes the zip file"""
def downloadInputData(competitionName, dataDir='input'):
  import importlib.util
  if importlib.util.find_spec('kaggle') is None:
    ! pip install kaggle --quiet
  import kaggle
  kaggle.api.authenticate() # raise an error if the kaggle.json is not in the expected location

  # download and unzip competition data
  ! rm -rf {dataDir}  # removing data files if they exist
  ! kaggle competitions download -q {competitionName} # -q for quite download
  ! mkdir -p {dataDir}
  zipFile = competitionName + '.zip'
  if not os.path.exists(zipFile):
    print(f"Error: , {zipFile}, not found.")
  else:
    # -q silent option (no output), concatenate rm to remove the zip file
    ! unzip -q {zipFile} -d {dataDir} && rm {zipFile}

"""Get kaggle.json file from Colab and puts in the expected location as it is specified by
https://github.com/Kaggle/kaggle-api?tab=readme-ov-file#api-credentials
It assumes the kaggle.json file is in Google Drive
at the Colab Notebooks location. This function can be invoked only from Colab,
because it is where the pacakge google.colab exists"""
def getKeyFileFromColab():
  from google.colab import drive # declaring it here to avoid ModuleNotFoundError in Kaggle
  # We need to escape the space ('\\ ')
  gdrive_kaggleCreds_file = '/content/drive/My\\ Drive/Colab\\ Notebooks/kaggle.json'
  kaggleDir = '~/.kaggle' # Destination folder
  kaggle_file = kaggleDir + '/' + 'kaggle.json' # Destination file
  drive.mount("/content/drive", force_remount=False)
  ! mkdir -p {kaggleDir} # -p option doesn't raise an error if the folder exists
  ! cp {gdrive_kaggleCreds_file} {kaggleDir}
  ! chmod 600 {kaggle_file} # user read/write
  drive.flush_and_unmount()

# Testing
isLocal = True # Using a local notebook or Kaggle
isColab = True # Control if the local environment is Colab
loadKeyFile = False # Control to download the Kaggle key file
competitionName = 'titanic'
dataDir = 'input/' if isLocal==True else '/kaggle/input/' + competitionName + '/'
workDir = 'working/' if isLocal==True else '/kaggle/working/'

if isLocal: # Creating the working folder when working locally
  ! mkdir -p {workDir}
  # Getting kaggle.json file from Colab and putting it in the correct location
  if isColab and loadKeyFile: getKeyFileFromColab()
  # downloading competition files from Kaggle
  downloadInputData(competitionName=competitionName, dataDir=dataDir)

字符串
函数getKeyFileFromColab只是下载存储在Colab中的kaggle.json文件,并将其放在预期的位置。如果您没有使用Colab,则无法调用此函数下载kaggle.json,您需要手动执行此操作,并将文件放在预期的文件夹位置:~/.kaggle,即$HOME/.kaggle。我们需要在同一个活动会话中下载此文件一次,这就是为什么我们有一个单独的函数。您可以通过loadKeyFile控制变量控制该过程。
一旦我们有了正确的设置,然后我们可以通过downloadInputData功能下载比赛文件。
输出:x1c 0d1x

o2g1uqev

o2g1uqev8#

为了让下一个人更容易,我将 fantasticanswer from CaitLAN Jenner与一小段代码结合起来,将原始csv信息放入Pandas DataFrame中,假设row 0具有列名。我用它从Kaggle下载了Pima Diabetes数据集,它运行得很顺利。
我相信有更优雅的方法来做到这一点,但它对我教的一门课来说已经足够好了,很容易解释,让你以最小的麻烦进行分析。

import pandas as pd
import requests
import csv

payload = {
    '__RequestVerificationToken': '',
    'username': 'username',
    'password': 'password',
    'rememberme': 'false'
}

loginURL = 'https://www.kaggle.com/account/login'
dataURL = "https://www.kaggle.com/uciml/pima-indians-diabetes-database/downloads/diabetes.csv"

with requests.Session() as c:
    response = c.get(loginURL).text
    AFToken = response[response.index('antiForgeryToken')+19:response.index('isAnonymous: ')-12]
    #print("AntiForgeryToken={}".format(AFToken))
    payload['__RequestVerificationToken']=AFToken
    c.post(loginURL + "?isModal=true&returnUrl=/", data=payload)
    download = c.get(dataURL)
    decoded_content = download.content.decode('utf-8')
    cr = csv.reader(decoded_content.splitlines(), delimiter=',')
    my_list = list(cr)
    #for row in my_list:
    #    print(row)

df = pd.DataFrame(my_list)
header = df.iloc[0]
df = df[1:]
diab = df.set_axis(header, axis='columns', inplace=False)

# to make sure it worked, uncomment this next line:
# diab

字符串
`

iqjalb3h

iqjalb3h9#

参考https://github.com/Kaggle/kaggle-api

**步骤_1,**尝试Insatling Kaggle

pip install kaggle # Windows
pip install --user kaggle # **Mac/Linux**.

字符串

第二步,

更新您的凭据,以便kaggle可以基于从Kaggle生成的令牌在.kaggle/kaggel_json上进行身份验证。ref:https://medium.com/@ankushchoubey/how-to-download-dataset-from-kaggle-7f700d7f9198

步骤3现在已安装kaggle competitions download ..

运行~/.local/bin/kaggle competitions download ..以避免Command Kaggle Not Found

相关问题