我正试图从《纽约时报》上获取一些页面的内容。
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'}
url='https://www.nytimes.com'
response=requests.get(url,headers=headers)
print(response)
这让我着迷
<Response [200]>
但如果我改变
url
例如,任何特定的文章-
https://www.nytimes.com/2022/11/19/sports/soccer/world-cup-qatar-2022.html
然后它给了我
<Response [403]>
。
import requests
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:105.0) Gecko/20100101 Firefox/105.0'}
url='https://www.nytimes.com/2022/11/19/sports/soccer/world-cup-qatar-2022.html'
response=requests.get(url,headers=headers)
print(response)
为什么会发生这种情况?如何让它发挥作用?
我还看了
robots.txt
我看不出任何明显的问题。