遥遥微光,与我同行
好久不见,各位小伙伴们!嗐,春节真滴快啊!祝大家新年快乐!
书山有路勤为径,学海无涯苦作舟!又得开始愉快滴学习了!
小夜斗今天给大家伙分享一期干货,芜湖起飞!
JS逆向网易云爬取评论并利用snownpl进行情感分析
一:逆向破解网易云参数抓取评论信息
网易云PC端url:
https://music.163.com/#/song?id=1817702136
要抓取滴评论如下图所示:
老规矩,检查网页元素,找到评论信息所在的请求网址!
从xhr里面找一下子就能找到,看下面截图:
如果直接请求这个网址的话,是拿不到上面的评论信息的,因为这个网址有两个动态加密的参数:
params
、
encSecKey
请求这个方有评论信息的url,我们需要上述两个参数构建表单发送POST请求,
现在我们需要做的事情就是探索这两个参数是如何生产的,最后拿到它俩,构造自己的表单,发送POST请求获取响应!
第一个办法
: 分析网页源码,找到生产参数所需要的方法,利用网页自身的代码拿到两个参数即可!(一般是分析javascript代码,俗称js逆向)
第二个办法
: 了解这个网页js代码如何生产的这两个参数,并利用python仿写js代码所具有的功能,自己构造俩个参数!
我们先从一大堆请求中找到带有
params
、
encSecKey
参数文件
按下ctrl + F 搜索params, 找到箭头所指的2文件!
点击源代码进入js文件,并点击格式化js代码,格式化后如图二
图一:
图二:
格式化后输入params找到其位置,一步步分析如何生产
!
好啦,到了最关键的地步,逆向分析这俩参数是如何加密产生的!
第一步, 找到生成这两个参数的js代码,如下所示:
把js代码扣下来看:
e5j. data = j5o. cr6l ( {
params: bWv4z. encText,
encSecKey: bWv4z. encSecKey
} )
}
看起来是’
bWv4z
’这个对象调用
encText
和
encSecKey
这两个方法分别生产的
params
和
encSecKey
第二步: 然后我们找找
bWv4z
对象是怎么生成的
var bWv4z = window. asrsea ( JSON . stringify ( i5n) , bsK6E ( [ "流泪" , "强" ] ) , bsK6E ( XR1x. md) , bsK6E ( [ "爱心" , "女孩" , "惊恐" , "大笑" ] ) ) ;
一个
window.asrsea
对象中传入4个参数生成bWv4z对象,其实这个时候可以先分析传入的四个参数是什么,或者先找到
window.asrsea
对象是如何生产的,这里我们先看后者
第三步:
window.asrsea
是如何产生的
通过上图我们知道,这个对象是由
d
产生的,一开始小夜斗也不知道d是个什么东西,通过搜索后发现,d是一个方法, 如下图所示:
这我们就知道了,
window.asrsea
相当于是d方法赋值(不太懂js代码,小夜斗自己是这么理解的),然后
window.asrsea
传入的四个参数就相当于调用d中需要传入的四个参数!
让我们打个断点看看,按下f5刷新页面,看看d中传入的四个参数!
d函数中四个参数如下图所示:
d: "{" csrf_token":" d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
我们再来看一下
window.asrsea
其中四个参数分别是上述: d e f g
var bWv4z = window. asrsea ( JSON . stringify ( i5n) , bsK6E ( [ "流泪" , "强" ] ) , bsK6E ( XR1x. md) , bsK6E ( [ "爱心" , "女孩" , "惊恐" , "大笑" ] ) ) ;
其中呢我们看第一个参数
JSON.stringify(i5n)
对应的是d,大概就是将
i5n
转化为json格式吧,我们打个断点看看最后
i5n
是什么!
i5n = { csrf_token: "d4339865ec133c9a7d77a25389bc0265" }
d: "{" csrf_token":" d4339865ec133c9a7d77a25389bc0265"}"
小夜斗换了一首歌发现,上面这四个参数都是固定的:
url:
https://music.163.com/#/song?id=1820887593
d: "{" csrf_token":" d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
重新打断点,选择第二页后有了新的发现!
注意,打第二页断点的时候,要先打断点,f5刷新后会跳转到第一页,之个时候你在选择第二页,就会加载参数内容了!
d: "{" rid":" R_SO_4_1817702136 "," threadId":" R_SO_4_1817702136 "," pageNo":" 2 "," pageSize":" 20 "," cursor":" 1613900247044 "," offset":" 40 "," orderType":" 1 "," csrf_token":" d4339865ec133c9a7d77a25389bc0265"}"
e: "010001"
f: "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7"
g: "0CoJUm6Qyw8W8jud"
让我们看看d: 里面几个参数的含义
“rid”:“R_SO_4_1817702136” 后面这个数字是网页url后面的id (根据id变换)
“threadId”:“R_SO_4_1817702136” 同上 (根据id 变换)
“pageNo”:“2” 页码数 (变量)
“pageSize”:“20” 每一页评论的数量 常量
“cursor”:“1613900247044” 应该是时间戳13位 (变量)
“offset”:“40” 偏移量 (页码数 * 20) (变量)
“orderType”:“1” 估计是啥类型是个常量
“csrf_token”:“d4339865ec133c9a7d77a25389bc0265”} 同样是个常量
好勒,了解了四个参数后我们可以看看d函数内部到底做了啥事情!
第四步:
d
函数内部到底做了啥事情!
把js代码扣下来如下所示:
function d ( d, e, f, g) {
var h = { }
, i = a ( 16 ) ;
return h. encText = b ( d, g) ,
h. encText = b ( h. encText, i) ,
h. encSecKey = c ( i, e, f) ,
h
}
定义了一个字典
h
, 变量
i
的值是a(16)
了解后发现,a、b、c、d都是函数
首先看看函数a内部:
function a ( a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" , c = "" ;
for ( d = 0 ; a > d; d += 1 )
e = Math. random ( ) * b. length,
e = Math. floor ( e) ,
c += b. charAt ( e) ;
return c
}
乍一看,我滴天这是啥子东西,别急我们用pycharm来执行这个js代码即可知道返回的c是个什么东西,即可倒推这个函数的功能得到
i
因为将网页源码的js代码copy到pycharm里面执行会因为某些换行符报错,小夜斗就将js代码copy到了下面这个软件: 好像是前端用滴!
pycharm中执行js文件的代码如下:
pip install PyExecJS
import execjs
import requests
js = open ( './analysis_3.js' , 'r' , encoding= 'utf8' ) . read( )
aim = execjs. compile ( js)
data = aim. call( 'example' )
print ( data)
结果如下所示: 每次生产不一样的长度为16的字符串!估计就是从a函数那个很长的字符串中随机选择16个字符串然后拼接在一起吧,小夜斗猜测这就是a函数的功能!
结果2如下图所示:
下面是第一次为了获得变量i值扣下来的js代码(analysis_3.js)
:
function d ( d, e, f, g) {
var h = { }
, i = a ( 16 ) ;
return h. encText = b ( d, g) ,
h. encText = b ( h. encText, i) ,
h. encSecKey = c ( i, e, f) ,
h
}
function a ( a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" , c = "" ;
for ( d = 0 ; a > d; d += 1 )
e = Math. random ( ) * b. length,
e = Math. floor ( e) ,
c += b. charAt ( e) ;
return c
}
function example ( ) {
i = a ( 16 ) ;
return i;
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
然后我们再回到d函数内部,执行代码
h.encText = b(d, g)
,即我们需要调用b函数,其中两个参数分别为
d,g
,这俩参数我们都能构造的知,问题不大!继续扣js代码!
先将函数b扣下来看看:
function d ( d, e, f, g) {
var h = { }
, i = a ( 16 ) ;
return h. encText = b ( d, g) ,
h. encText = b ( h. encText, i) ,
h. encSecKey = c ( i, e, f) ,
h
}
function a ( a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" , c = "" ;
for ( d = 0 ; a > d; d += 1 )
e = Math. random ( ) * b. length,
e = Math. floor ( e) ,
c += b. charAt ( e) ;
return c
}
function b ( a, b) {
var c = CryptoJS. enc. Utf8. parse ( b)
, d = CryptoJS. enc. Utf8. parse ( "0102030405060708" )
, e = CryptoJS. enc. Utf8. parse ( a)
, f = CryptoJS. AES . encrypt ( e, c, {
iv: d,
mode: CryptoJS. mode. CBC
} ) ;
return f. toString ( )
}
function example ( ) {
i = a ( 16 ) ;
h = { }
h. encText = b ( { "rid" : "R_SO_4_1817702136" , "threadId" : "R_SO_4_1817702136" , "pageNo" : "2" , "pageSize" : "20" , "cursor" :
"1613900247044" , "offset" : "40" , "orderType" : "1" , "csrf_token" : "d4339865ec133c9a7d77a25389bc0265" } , "0CoJUm6Qyw8W8jud" )
return h;
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
从Pycharm执行这个js文件发现报错:
CryptoJS is not defined
就是js代码中少了**CryptoJS **这个函数功能,问题不大我们从js源码中扣下来即可!就搜这个函数名字,然后找到看起来像这个函数复制下来即可!不难!
第三次扣下来的js代码如下所示:
function d ( d, e, f, g) {
var h = { }
, i = a ( 16 ) ;
return h. encText = b ( d, g) ,
h. encText = b ( h. encText, i) ,
h. encSecKey = c ( i, e, f) ,
h
}
function a ( a) {
var d, e, b = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789" , c = "" ;
for ( d = 0 ; a > d; d += 1 )
e = Math. random ( ) * b.
length,
e = Math. floor ( e) ,
c += b. charAt ( e) ;
return c
}
function b ( a, b) {
var c = CryptoJS. enc. Utf8. parse ( b)
, d = CryptoJS. enc. Utf8. parse ( "0102030405060708" )
, e = CryptoJS. enc. Utf8. parse ( a)
, f = CryptoJS. AES . encrypt ( e, c, {
iv: d,
mode: CryptoJS. mode. CBC
} ) ;
return f. toString ( )
}
var CryptoJS = CryptoJS || function ( u, p) {
var d = { }
, l = d. lib = { }
, s = function ( ) { }
, t = l. Base = {
extend: function ( a) {
s. prototype = this ;
var c = new s ;
a && c. mixIn ( a) ;
c. hasOwnProperty ( "init" ) || ( c. init = function ( ) {
c. $super . init. apply ( this , arguments)
}
) ;
c. init. prototype = c;
c. $super = this ;
return c
} ,
create: function ( ) {
var a = this . extend ( ) ;
a. init. apply ( a, arguments) ;
return a
} ,
init: function ( ) { } ,
mixIn: function ( a) {
for ( var c in a)
a. hasOwnProperty ( c) && ( this [ c] = a[ c] ) ;
a. hasOwnProperty ( "toString" ) && ( this . toString = a. toString)
} ,
clone: function ( ) {
return this . init. prototype. extend ( this )
}
}
, r = l. WordArray = t. extend ( {
init: function ( a, c) {
a = this . words = a || [ ] ;
this . sigBytes = c != p ? c : 4 * a. length
} ,
toString: function ( a) {
return ( a || v) . stringify ( this )
} ,
concat: function ( a) {
var c = this . words
, e = a. words
, j = this . sigBytes;
a = a. sigBytes;
this . clamp ( ) ;
if ( j % 4 )
for ( var k = 0 ; k < a; k++ )
c[ j + k >>> 2 ] |= ( e[ k >>> 2 ] >>> 24 - 8 * ( k % 4 ) & 255 ) << 24 - 8 * ( ( j + k) % 4 ) ;
else if ( 65535 < e. length)
for ( k = 0 ; k < a; k += 4 )
c[ j + k >>> 2 ] = e[ k >>> 2 ] ;
else
c. push. apply ( c, e) ;
this . sigBytes += a;
return this
} ,
clamp: function ( ) {
var a = this . words
, c = this . sigBytes;
a[ c >>> 2 ] &= 4294967295 << 32 - 8 * ( c % 4 ) ;
a. length = u. ceil ( c / 4 )
} ,
clone: function ( ) {
var
a = t. clone. call ( this ) ;
a. words = this . words. slice ( 0 ) ;
return a
} ,
random: function ( a) {
for ( var c = [ ] , e = 0 ; e < a; e += 4 )
c. push ( 4294967296 * u. random ( ) | 0 ) ;
return new r. init ( c, a)
}
} )
, w = d. enc = { }
, v = w. Hex = {
stringify: function ( a) {
var c = a. words;
a = a. sigBytes;
for ( var e = [ ] , j = 0 ; j < a; j++ ) {
var k = c[ j >>> 2 ] >>> 24 - 8 * ( j % 4 ) & 255 ;
e. push ( ( k >>> 4 ) . toString ( 16 ) ) ;
e. push ( ( k & 15 ) . toString ( 16 ) )
}
return e. join ( "" )
} ,
parse: function ( a) {
for ( var c = a. length, e = [ ] , j = 0 ; j < c; j += 2 )
e[ j >>> 3 ] |= parseInt ( a. substr ( j, 2 ) , 16 ) << 24 - 4 * ( j % 8 ) ;
return new r. init ( e, c / 2 )
}
}
, b = w. Latin1 = {
stringify: function ( a) {
var c = a. words;
a = a. sigBytes;
for ( var
e = [ ] , j = 0 ; j < a; j++ )
e. push ( String. fromCharCode ( c[ j >>> 2 ] >>> 24 - 8 * ( j % 4 ) & 255 ) ) ;
return e. join ( "" )
} ,
parse: function ( a) {
for ( var c = a. length, e = [ ] , j = 0 ; j < c; j++ )
e[ j >>> 2 ] |= ( a. charCodeAt ( j) & 255 ) << 24 - 8 * ( j % 4 ) ;
return new r. init ( e, c)
}
}
, x = w. Utf8 = {
stringify: function ( a) {
try {
return decodeURIComponent ( escape ( b. stringify ( a) ) )
} catch ( c ) {
throw Error ( "Malformed UTF-8 data" )
}
} ,
parse: function ( a) {
return b. parse ( unescape ( encodeURIComponent ( a) ) )
}
}
, q = l. BufferedBlockAlgorithm = t. extend ( {
reset: function ( ) {
this . i5n = new r. init ;
this . sN2x = 0
} ,
vE3x: function ( a) {
"string" == typeof a && ( a = x. parse ( a) ) ;
this . i5n. concat ( a) ;
this . sN2x += a. sigBytes
} ,
lg9X: function ( a) {
var c = this . i5n
, e = c. words
, j = c. sigBytes
, k = this . blockSize
, b = j / (
4 * k)
, b = a ? u. ceil ( b) : u. max ( ( b | 0 ) - this . GX6R , 0 ) ;
a = b * k;
j = u. min ( 4 * a, j) ;
if ( a) {
for ( var q = 0 ; q < a; q += k)
this . qB1x ( e, q) ;
q = e. splice ( 0 , a) ;
c. sigBytes -= j
}
return new r. init ( q, j)
} ,
clone: function ( ) {
var a = t. clone. call ( this ) ;
a. i5n = this . i5n. clone ( ) ;
return a
} ,
GX6R : 0
} ) ;
l. Hasher = q. extend ( {
cfg: t. extend ( ) ,
init: function ( a) {
this . cfg = this . cfg. extend ( a) ;
this . reset ( )
} ,
reset: function ( ) {
q. reset. call ( this ) ;
this . lJ9A ( )
} ,
update: function ( a) {
this . vE3x ( a) ;
this . lg9X ( ) ;
return this
} ,
finalize: function ( a) {
a && this . vE3x ( a) ;
return this . mG0x ( )
} ,
blockSize: 16 ,
lV0x: function ( a) {
return function ( b, e) {
return ( new a. init ( e) ) . finalize ( b)
}
} ,
vC3x: function ( a) {
return function ( b, e) {
return ( new n. HMAC. init ( a, e) ) . finalize ( b)
}
}
} ) ;
var n = d. algo = { } ;
return d
} ( Math) ;
function example ( ) {
i = a ( 16 ) ;
h = { }
h. encText = b ( { "rid" : "R_SO_4_1817702136" , "threadId" : "R_SO_4_1817702136" , "pageNo" : "2" , "pageSize" : "20" , "cursor" : "1613900247044" , "offset" : "40" , "orderType" : "1" , "csrf_token" : "d4339865ec133c9a7d77a25389bc0265" } , "0CoJUm6Qyw8W8jud" )
return h;
}
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
嗐,又报错了,还是代码功能不全,这次又是缺少相应的函数:
Cannot read property ‘encrypt’ of undefined
莫得办法继续找代码,cv代码呗,逆向就是这样子要有耐心…
老样子,搜索encrypt这个参数,不过这个参数很奇怪,没有找到相应的生产函数,但是好像要联合几个函数一起执行,小夜斗就都给弄下来了!
嗐篇幅太长了,这里就不上js代码了,后面文末自行获取即可!
上主要功能的js代码吧,不然不好解释:
function example ( ) {
i = a ( 16 ) ;
h = { }
h. encText = b ( { "rid" : "R_SO_4_1817702136" , "threadId" : "R_SO_4_1817702136" , "pageNo" : "2" , "pageSize" : "20" , "cursor" : "1613900247044" , "offset" : "40" , "orderType" : "1" , "csrf_token" : "d4339865ec133c9a7d77a25389bc0265" } , "0CoJUm6Qyw8W8jud" )
return h;
}
此时d函数内部
function d ( d, e, f, g) {
var h = { }
, i = a ( 16 ) ;
return h. encText = b ( d, g) ,
h. encText = b ( h. encText, i) ,
h. encSecKey = c ( i, e, f) ,
h
}
js代码主函数功能:
function example ( ) {
i = a ( 16 ) ;
h = { }
h. encText = b ( { "rid" : "R_SO_4_1817702136" , "threadId" : "R_SO_4_1817702136" , "pageNo" : "2" , "pageSize" : "20" , "cursor" : "1613900247044" , "offset" : "40" , "orderType" : "1" , "csrf_token" : "d4339865ec133c9a7d77a25389bc0265" } , "0CoJUm6Qyw8W8jud" )
h. encText = b ( h. encText, i)
return h;
}
此时d函数内部
function d ( d, e, f, g) {
var h = { }
, i = a ( 16 ) ;
return h. encText = b ( d, g) ,
h. encText =
b ( h. encText, i) ,
h. encSecKey = c ( i, e, f) ,
h
}
js代码主函数功能:
function example ( ) {
i = a ( 16 ) ;
h = { }
h. encText = b ( { "rid" : "R_SO_4_1817702136" , "threadId" : "R_SO_4_1817702136" , "pageNo" : "2" , "pageSize" : "20" , "cursor" : "1613900247044" , "offset" : "40" , "orderType" : "1" , "csrf_token" : "d4339865ec133c9a7d77a25389bc0265" } , "0CoJUm6Qyw8W8jud" )
h. encText = b ( h. encText, i)
h. encSecKey = c ( i, "010001" , "00e0b509f6259df8642dbc35662901477df22677ec152b5ff68ace615bb7b725152b3ab17a876aea8a5aa76d2e417629ec4ee341f56135fccf695280104e0312ecbda92557c93870114af6c9d05c4f7f0c3685b7a46bee255932575cce10b424d813cfe4875d3e82047b97ddef52741d546b8e289dc6935b3ece0462db0a22b8e7" )
return h;
}
执行后,报错了,因为缺少了没有定义的函数,然后后面一系列连续的这样子的报错,小夜斗自己弄了一个多小时才把函数补充完全,然后直接截图给大家伙看吧!
嗐,其实最后总结发现就是从那个!上面那里一直复制到第一次报错所需要的函数那里!
其实最后总结发现:
function d ( d, e, f, g) {
var h = { }
, i = a ( 16 ) ;
return h. encText = b ( d, g) ,
h. encText = b ( h. encText, i) ,
h. encSecKey = c ( i, e, f) ,
h
}
asrsea = d;
var bWv4z = asrsea ( JSON . stringify ( d) , e, f, g) ;
核心思想: 最终就是调用d这个函数,传入四个参数,最后得到的值是一个字典赋值给
bWv4z
, 然后b函数内部就牵扯到其它需要的函数,缺啥补啥就行!
var data = {
params: bWv4z. encText,
encSecKey: bWv4z. encSecKey
}
然后这个字典中的encText和encSecKey对应的值就是params和encSecKey俩加密参数
{ 'params' : 'GeZ2hGQu0LGQlB4VQebjp6n74Oq4/32rvafzEjRm9YSwMU7MBR9hC8f4riioTrVZien4zLXoPv+AVMUy5YV0Z/57uz6MbnX6pcyS99OSzJcvbBzgM5oTFpS2faYdUCieyRYIWmna8c9SwS/yE+/EsaA3GMRpXoMhnV1ibdUY0/NUuDT5QpXjlNirryMJN0N66FvDT3yPS1aVEuCiEE9h3833g107ljF8vEkguSOBxi7eRMgT2W1nz9HQNJU5pniYsc8ntMeQESk4NblkNnEx6307E3uxMeAST2uJPchaTc4tb+TcDlZN/PLpz2OV62hJic9dNEfaxic7Jybvtn+I6lyyrD11x4xe4b7s915g5eo=' , 'encSecKey' : '8e8659fcff20f47c9823685b6b86cf976f7d7bfa9db447e3a8437839c0ed7837d529c9c7c245c9807f3277c85c6141f2621ad916c81d5db964eb56282d016142e4058db17aafb7bca8869b3fa537ba7422b347731526cc86e3c117277e3a569348ed51da09e5331ce3fad1c381c17fc0bef001a43cae46a22a48329c554c0f56' }
将这个俩参数拿到然后构建表单发送post请求
就拿到了如图所示的评论信息!
python代码如下:
import execjs
import requests
import json
import os, sys, io
sys. stdout = io. TextIOWrapper( sys. stdout. buffer , encoding= 'gb18030' )
js = open ( './网易云.js' , 'r' , encoding= 'utf8' ) . read( )
aim = execjs. compile ( js)
data = aim. call( 'start' )
print ( data)
url = "https://music.163.com/weapi/comment/resource/comments/get?csrf_token=d4339865ec133c9a7d77a25389bc0265"
headers = {
'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74' ,
'cookie' : 'WM_TID=6S%2BVxiZNM4xBURVBQUYqChbHQqRPtWJo; _iuqxldmzr_=32; ntes_kaola_ad=1; _ntes_nnid=4a37e82e4fab3d88933e0d5c379d440e,1593854146443; _ntes_nuid=4a37e82e4fab3d88933e0d5c379d440e; vinfo_n_f_l_n3=65b5de1ae241f0fb.1.2.1596075019898.1596077656335.1596082873052; UM_distinctid=1776c974d24410-0bb2c863a112f2-50391c40-1fa400-1776c974d25bd1; NMTID=00OlBs1TG5OHV33AkargjKWlPeNAmEAAAF3i9nSKg; __csrf=d4339865ec133c9a7d77a25389bc0265; MUSIC_U=bed21edcbf5c4808fefc260d37faef01b6d633afc3be6807d7f44dcfc98c3d5433a649814e309366; WM_NI=FH%2Bdu5qU8nodD0Tz91k%2F%2BSZTQaN1uVEUrJ6lgO3CNyrzxn2qHc4gqOO5ytsdhxlqnw%2Fr02fjvENiMGXplNN0Y%2BehIKexHjWU%2FJN05AfaCWE%2F9GO4sfosi0qw38qX3M8KaWs%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6ee92fc6292f5ff89ca448cb08fa3d54a839e8e84f56ff58b8491b673b2b98ab3b42af0fea7c3b92a8698e585f83fb38bf88fee7088f09c9bbc3ba9f1e5b4d96093bf82d2f85e878bbeabb568a591fda4cd73b2a6b88fb17b8cbdfc88f4458debb795f270aca8a492ed7dedae00d3e562b69b828ad04d9cb0a4afcd3c9bf5ab89cc41a7a8888bd65fed9682d1b84ba6bf8db8c825fb9981d5f26eadada7d1f47df1e8bfb9b247fc95afb8f237e2a3; JSESSIONID-WYYY=1bOvTxGdlne5T%2B%5CvUNf761crAP4mCs4kXFSp25NsXXnrNWkrO5Tk8p5ykpnsb9X%2B%2F1ofrjxduvuZfC2kPNgz40bpqqlCgYlf1f2cXl%5C1yRO8aZ0IlubEo7n7xs0AX%2FffBDdG5t12CuiOPHI%5CZPleGyhEmbbOJt%2Bkt6XxSohZCQiQmPWr%3A1613963805180; WEVNSM=1.0.0; WNMCID=iuuttf.1613962005402.01.0' ,
'referer' : 'https://music.163.com/song?id=1817702136' ,
}
r = requests. post( url= url, headers= headers, data= data)
if r. status_code == 200 :
print ( "成功访问网易云接口" )
text = r. text. encode( 'utf8' , "ignore" ) . decode( 'utf8' , "ignore" )
content = json. loads( text)
user_list = content[ 'data' ] [ 'comments' ]
for user in user_list:
if user[ 'beReplied' ] != None :
for item in user[ 'beReplied' ] :
print ( item[ 'content' ] )
else :
print ( "error" )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
ps : 逆向过程的基本上是去扣源码或者自己用python或其他语言复刻出源码中所需要的功能来得到你加密的参数,少什么函数就扣下来放到js文件中,就这样一部部来,总之很麻烦!
上述只是爬取一页的代码,如果需要拿去多页的话,就用重写函数定义传参即可,d中滴参数上述已经描述了!
然后小夜斗这边试了一下,希望有大佬看到能够指点一下,为什么到了第十页就爬取不了了!欢迎各位大佬评论区留言
小夜斗的python代码如下:
import execjs
import requests
import json
import os, sys, io
import time
sys. stdout = io. TextIOWrapper( sys. stdout. buffer , encoding= 'gb18030' )
page = int ( input ( "输入要查询的页码数:" ) )
for i in range ( 1 , page+ 1 )
:
print ( f'i:{i}' )
offset = str ( i * 20 )
cursor = str ( int ( time. time( ) * 1000 ) )
js = open ( './网易云.js' , 'r' , encoding= 'utf8' ) . read( )
aim = execjs. compile ( js)
data = aim. call( 'start' , offset, str ( page) , cursor)
url = "https://music.163.com/weapi/comment/resource/comments/get?csrf_token=d4339865ec133c9a7d77a25389bc0265"
headers = {
'user-agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74' ,
'cookie' : 'WM_TID=6S%2BVxiZNM4xBURVBQUYqChbHQqRPtWJo; _iuqxldmzr_=32; ntes_kaola_ad=1; _ntes_nnid=4a37e82e4fab3d88933e0d5c379d440e,1593854146443; _ntes_nuid=4a37e82e4fab3d88933e0d5c379d440e; vinfo_n_f_l_n3=65b5de1ae241f0fb.1.2.1596075019898.1596077656335.1596082873052; UM_distinctid=1776c974d24410-0bb2c863a112f2-50391c40-1fa400-1776c974d25bd1; NMTID=00OlBs1TG5OHV33AkargjKWlPeNAmEAAAF3i9nSKg; __csrf=d4339865ec133c9a7d77a25389bc0265; MUSIC_U=bed21edcbf5c4808fefc260d37faef01b6d633afc3be6807d7f44dcfc98c3d5433a649814e309366; WM_NI=d1%2F2ZoLk6b6YJnwdLJEo2E4vkr7u2MMvjXLw34zWu7E15cNm%2BUQL7G14j36Nw4Y1JMsQZzutuQr%2FBeI1EiTJIp7tOnFsp%2F63a6a8MFDWIwVCaeh7P9%2FjfqQJTf28V7XVY3I%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6eeb5b769b097faa2f64eae868eb2c54b979e9e85b645f58c8ab5c67a8fb197b5aa2af0fea7c3b92aa8efe1d3d96590b5fa83b5538ab2b985cf4287b1fea6b77ea9aef996b1478faaa6b9cf7df1ecb6b8f83d87effb84cd67e98b9fadb64df3eca6d4b546b5ec97d2c947b18f969bf880b78fbed0c16f8bef8b8ad7428bb0f995ce6686adc0d0b539a3eaafd9d04e88898798d549a3bba8d8d333edb6aa92d44298bafed9d53d9c8a97d4cc37e2a3; __root_domain_v=.163.com; _qddaz=QD.uerirb.s0cddl.klgiphn9; hb_MA-9F44-2FC2BD04228F_source=www.baidu.com; JSESSIONID-WYYY=Dnl9Q%5CsA%5CgFxr35Z6Yop8S4cgUw2XbSi63P5%5CylO1H9i%5CiJfKGDN7wYoc6nyGftQ6UtwmY5A6PTypX2mR147jRrZkF2zDqwuyQA4%2F%2F3CQnOhFKXz57z8WCCHeX9%5Cz%2BtVzo%2Byu%2B5un%5CR4af37PI%2Bxj7FNIbe9dCG9pRuMZ%2Bv4mGhyJG90%3A1613996422319; WEVNSM=1.0.0; WNMCID=sqowiw.1613994622490.01.0' ,
'referer' : 'https://music.163.com/song?id=1817702136' ,
}
r = requests. post( url= url, headers= headers, data= data)
if r. status_code == 200 :
text = r. text. encode( 'utf8' , "ignore" ) . decode( 'utf8' , "ignore" )
content = json. loads( text)
user_list = content[ 'data' ] [ 'comments' ]
for user in user_list:
if user[ 'beReplied' ] != None :
for item in user[ 'beReplied' ] :
print ( item[ 'content' ] )
with open ( '星辰大海.txt' , 'a' , encoding= "utf8" ) as f:
f. write( item[ 'content' ] )
f. write( '\n' )
f. write( user[ 'content' ] )
f. write( '\n' )
print ( user[ 'content' ] )
time. sleep(
0.5 )
else :
print ( "error" )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
二:利用snownpl进行情感分析
首先没有安装这个库的小伙伴安装可以先安装!
pip install snownlp
这个库怎么说呢,小夜斗感觉就是专门针对中文进行情感判断的,因为这篇重心是在于js逆向,情感分析就简单点啦!
首先我们来看这个库的基本使用:
from snownlp import SnowNLP
sentence = u'欢迎大家订阅小夜斗的博客'
word_list = SnowNLP( sentence)
print ( word_list)
print ( word_list. words)
print ( ' ' . join( word_list. words) )
emotion_score = word_list. sentiments
print ( emotion_score)
看样子这句话情感分数不是很高哈,不知道是不是没有调参的原因,问题不大,今天小夜斗就简单介绍这个库的使用哈!
下面我们再来看一句话
sentence_2 = "我喜欢你,我想和你在一起!"
word_list_2 = SnowNLP( sentence_2)
emotion_score_2 = word_list_2. sentiments
print ( emotion_score_2)
嗯看起来还行,0.7的评分,最高也就是1了!
sentence_3 = "我讨厌你,我不想和你在一起!"
word_list_3 = SnowNLP( sentence_3)
emotion_score_3 = word_list_3. sentiments
print ( emotion_score_3)
小夜斗此时哭晕在厕所,我的天这句讨厌人的话评分竟然比小夜斗欢迎人的话还高,嗯!肯定是因为没有经过模型训练的原因,对!
问题不大问题不大,接下来生成一个简单词云看看!
import jieba
import sys
import matplotlib. pyplot as plt
from wordcloud import WordCloud
text = open ( '星辰大海.txt' , encoding= 'utf8' ) . read( )
print ( type ( text) )
wordlist = jieba. cut( text, cut_all= False )
wl_space_split = " " . join( wordlist)
print ( wl_space_split)
wc = WordCloud(
background_color= "white" ,
max_words= 200 ,
font_path= r'C:/Windows/Fonts/STXINGKA.TTF' ,
width= 400 ,
height= 200 ,
scale= 10 ) . generate( wl_space_split)
plt. imshow( wc)
plt. axis( "off" )
plt. show( )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
星辰大海这首歌呢,双向奔赴,是一首爱情歌曲吧,9页评论出现最多的是喜欢、惊喜、花,这就是爱情嘛!
打开星辰大海.txt这个评论文本文件,对每其中每一句评论进行情感分析,进而画出一个柱状图,看整个歌曲的情感倾斜如何!
from snownlp import SnowNLP
import codecs
import os
source = open ( "星辰大海.txt" , "r" , encoding= "utf8" )
line = source. readlines( )
sentimentslist = [ ]
for i in line:
s = SnowNLP( i. encode( "utf-8" ) . decode( "utf-8" ) )
print ( s. sentiments)
sentimentslist. append( s. sentiments)
import matplotlib. pyplot as plt
import numpy as np
plt. rcParams[ 'font.sans-serif' ] = [ 'SimHei' ]
plt. rcParams[ 'axes.unicode_minus' ] = False
plt. hist( sentimentslist, bins= np. arange( 0 , 1 , 0.01 ) , facecolor= 'g' )
plt. xlabel( '情感评分' , size= 12 )
plt. ylabel( '某个情感评分的数量' , size= 12 )
plt. title( '星辰大海整体情感分析' , color= "red" , size= 12 )
plt. show( )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
星辰大海这首歌呢强调的是爱情是双向奔赴,从个人主观判断上来看整首歌更倾向于正面积极、但也不缺乏某些因为爱情感到难过悲伤的人,从上述柱状图来看大致分布是在[0.5, 1]区间之间,其中[0.8,1]区间的人数占比较大,消极评论[0,0.2]只有那么一两个,可能是因为被爱情伤透了心吧!
有关数据挖掘知识给大家伙介绍一个宝藏博主:
https://blog.csdn.net/eastmount/article/details/52577215
好啦,本期博客就到此啦,感兴趣的小伙伴们不烦点赞收藏一波!
源码数据:关注微信公众号"
夜斗小神社
"后台回复"
007网易数据
"
在这个星球上,你很重要,请珍惜你的珍贵! ~~~夜斗小神社