创作新主题

社区所有版块导航

Python

python开源 Django Python DjangoApp pycharm

DATA

docker Elasticsearch

问与答闲聊招聘翻译创业分享发现分享创造求职区块链支付之战

aigc

aigc chatgpt

WEB开发

linux MongoDB Redis DATABASE NGINX 其他Web框架 web工具 zookeeper tornado NoSql Bootstrap js peewee Git bottle IE MQ Jquery

机器学习

机器学习算法

Python88.com

反馈公告社区推广

产品

短视频

印度

一周十大热门主题

10年顽疾ChatGPT一眼识破！AlphaGo时刻震撼全球医疗界

多邻国联姻瑞幸；if椰子水母企IPO市值冲上100亿；阿里巴巴美国站推B2B先买后付 | TopDi...

时代命题下的民营科技担当：从备份战略看Gitee的国家定位

专访上海电气品牌公关总监张笛：从“人带人闯市场”到全球品牌，大国重器的出海进化论 | TopDigi...

DigiTwin | 上海交大贺兴：基于数字孪生的时空数据分析

Nginx和Apache要成旧爱了？PHP有了新搭档：缝合怪FrankenPHP！

Altman 嘲讽 Meta 挖走的不是顶尖人才，OpenAI 高管首曝内幕：ChatGPT 如何让...

Ilya尘封10年录音曝光！大二入Hinton门下，竟坦言机器学习反直觉

ChatGPT 4.5 国内直接用！

Stata：机器学习与因果推断前沿

私信 • 关注

constantstranger

constantstranger 最近创建的主题

» constantstranger 创建的更多主题

constantstranger 最近回复了

3 年前

回复了 constantstranger 创建的主题 » 如何在python中将多个dictionary对象合并到一个列表中?

( 笔记 :你可能想看看用熊猫来做这个。请参阅此答案的下半部分以获取示例。)

直接回答你的问题: 如果您想要实际的类型化值(int表示英亩,float表示纬度/经度),下面是一种使用正则表达式的方法:

import re
def readParkFile(fileName="national_parks.csv"):
    parkList = []
    f = open(fileName, "r")
    keys = f.readline().strip("\n").split(",")
    for line in f:
        v = re.search('(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*?),(.*)', line).groups()
        Code, Name, State, Acres, Latitude, Longitude, Date, Description = v[0], v[1], v[2], int(v[3]), float(v[4]), float(v[5]), v[6], v[7][1:-1]
        parkDict = dict(zip(keys, [Code, Name, State, Acres, Latitude, Longitude, Date, Description]))
        parkList.append(parkDict)
    return parkList

parkList = readParkFile()

在循环内部, re.search() 在所有逗号分隔的组中使用非贪婪限定符,但最后一个组除外( Description ),然后将数字字段转换为数字,并从中删除周围的引号 描述 。然后将这些类型化的值与 keys 使用 zip() 变成了一个 dict 然后再将其附加到结果中, parkList .循环csv文件中的所有行后,函数返回 公园名单 .

结果dict列表的第一个dictionary元素如下所示:

{'Code': 'ACAD', 'Name': 'Acadia National Park', 'State': 'ME', 'Acres': 47390, 'Latitude': 44.35, 'Longitude': -68.21, 'Date': '1919-02-26', 'Description': 'Covering most of Mount Desert Island and other coastal islands, Acadia features the tallest mountain on the Atlantic coast of the United States, granite peaks, ocean shoreline, woodlands, and lakes. There are freshwater, estuary, forest, and intertidal habitats.'}

另一种使用熊猫的方法是: 在熊猫中,你可以这样做:

import pandas as pd
df = pd.read_csv("national_parks.csv")
print(df)
print(df.dtypes)
parkList = df.to_dict('records')
print(parkList[0])

它将给出以下输出:

    Code                                        Name State   Acres  Latitude  Longitude        Date                                        Description
0   ACAD                        Acadia National Park    ME   47390     44.35     -68.21  1919-02-26  Covering most of Mount Desert Island and other...
1   ARCH                        Arches National Park    UT   76519     38.68    -109.57  1971-11-12  This site features more than 2,000 natural san...
2   BADL                      Badlands National Park    SD  242756     43.75    -102.50  1978-11-10  The Badlands are a collection of buttes, pinna...
3   BIBE                      Big Bend National Park    TX  801163     29.25    -103.25  1944-06-12  Named for the prominent bend in the Rio Grande...
4   BISC                      Biscayne National Park    FL  172924     25.65     -80.08  1980-06-28  The central part of Biscayne Bay, this mostly ...
5   BLCA  Black Canyon of the Gunnison National Park    CO   32950     38.57    -107.72  1999-10-21  The park protects a quarter of the Gunnison Ri...
6   BRCA                  Bryce Canyon National Park    UT   35835     37.57    -112.18  1928-02-25  Bryce Canyon is a geological amphitheater on t...
7   CANY                   Canyonlands National Park    UT  337598     38.20    -109.93  1964-09-12  This landscape was eroded into a maze of canyo...
8   CARE                  Capitol Reef National Park    UT  241904     38.20    -111.17  1971-12-18  The park's Waterpocket Fold is a 100-mile (160...
9   CAVE              Carlsbad Caverns National Park    NM   46766     32.17    -104.44  1930-05-14  Carlsbad Caverns has 117 caves, the longest of...
10  CHIS               Channel Islands National Park    CA  249561     34.01    -119.42  1980-03-05  Five of the eight Channel Islands are protecte...
11  CONG                      Congaree National Park    SC   26546     33.78     -80.78  2003-11-10  On the Congaree River, this park is the larges...
12  CRLA                   Crater Lake National Park    OR  183224     42.94    -122.10  1902-05-22  Crater Lake lies in the caldera of an ancient ...
13  CUVA               Cuyahoga Valley National Park    OH   32950     41.24     -81.55  2000-10-11  This park along the Cuyahoga River has waterfa...
Code            object
Name            object
State           object
Acres            int64
Latitude       float64
Longitude      float64
Date            object
Description     object
dtype: object
{'Code': 'ACAD', 'Name': 'Acadia National Park', 'State': 'ME', 'Acres': 47390, 'Latitude': 44.35, 'Longitude': -68.21, 'Date': '1919-02-26', 'Description': 'Covering most of Mount Desert Island and other coastal islands, Acadia features the tallest mountain on the Atlantic coast of the United States, granite peaks, ocean shoreline, woodlands, and lakes. There are freshwater, estuary, forest, and intertidal habitats.'}

正如你所见,只需一个电话 read_csv() ,pandas解析csv文件,计算出每列的数据类型,并将所有这些组合成一个DataFrame对象。然后,您可以通过调用 to_dict() 带有参数的DataFrame对象的 'records' .

3 年前

回复了 constantstranger 创建的主题 » Python熊猫-填写df NaN值

医生 here 展示如何做到这一点:

df.fillna(method="ffill")

3 年前

回复了 constantstranger 创建的主题 » 如何使用正则表达式从Python文本中提取段落

干得好:

import re
text = "Okay. So.\n What do we do now?\n I have no clue.\n"
print(re.findall(".*\n",text))

... 给予:

['Okay. So.\n', ' What do we do now?\n', ' I have no clue.\n']

更新:问题中描述的行为可以实现如下:

import re
text = "but Okay. So.\n What do we\n do now?\n I have no clue.\n"
print(re.findall('[A-Z].*?[?!.]\n', text, re.S))

输出:

['Okay. So.\n', 'What do we\n do now?\n', 'I have no clue.\n']

说明:

[A-Z]匹配前导大写字母
.*? 由于 ? 详情见 docs 详情如下:

?, +?, ?? 那是 “,“+”和“?”限定词都是贪婪的;它们尽可能多地匹配文本。有时这种行为并不可取;如果重新<。 >与“b”匹配,它将匹配整个字符串,而不仅仅是“”。添加?在限定符使其以非贪婪或最小的方式执行匹配之后;尽可能少的字符将被匹配。使用RE<。 ?>将只匹配“”。

[?!.]\出现“?”,“!”时匹配n次或“.”后跟“\n”,这要感谢非贪婪限定符 ? 在前面的项目符号中,它不会跳过任何与此匹配的内容,而是将字符串划分为问题中描述的“段落”。

3 年前

回复了 constantstranger 创建的主题 » 使用python在字符串列表中查找unqiue子字符串模式

您可以使用相同的正则表达式(我在会话部分添加了一个连字符),并将集合更改为关键字/值为subject/first session的dict。考虑到您希望以不同的方式对待每个主题的第一行,我认为您当前使用循环列表元素的方法很好。

all_files = [
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-130_S_4817-ses-2018-05-04_14_33_33.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-141_S_0767-ses-2019-04-08_12_52_36.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-041_S_5097-ses-2019-05-07_09_56_14.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-068_S_4061-ses-2017-09-26_14_07_37.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-002_S_1280-ses-2017-03-13_13_38_31.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-082_S_5282-ses-2019-06-17_10_11_15.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-018_S_4399-ses-2019-08-06_13_03_58.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-123_S_0106-ses-2018-10-11_12_54_59.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-141_S_2333-ses-2018-12-26_15_31_55.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-031_S_2018-ses-2019-01-24_11_26_13.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-041_S_0679-ses-2017-07-05_09_46_36.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-037_S_0303-ses-2017-05-11_13_39_46.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-037_S_0454-ses-2017-09-06_09_41_25.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-068_S_2187-ses-2019-10-09_13_19_17.0.txt',
 '/home/xin/Downloads/BrainImaging_UNC/out04_adni_roi_signals2/roi_signals_power264_sub-116_S_4043-ses-2018-03-02_10_03_10.0.txt'
]

import re
unique_subject = {}

for f in all_files:
    groups = re.search('(.*)_sub-(.*)-ses-(.*).txt', f)
    subject = groups.group(2)
    if subject not in unique_subject:
        session = groups.group(3)
        unique_subject[subject] = session

[print(f"{k} : {v}") for k, v in unique_subject.items()]

输出:

130_S_4817 : 2018-05-04_14_33_33.0
141_S_0767 : 2019-04-08_12_52_36.0
041_S_5097 : 2019-05-07_09_56_14.0
068_S_4061 : 2017-09-26_14_07_37.0
002_S_1280 : 2017-03-13_13_38_31.0
082_S_5282 : 2019-06-17_10_11_15.0
018_S_4399 : 2019-08-06_13_03_58.0
123_S_0106 : 2018-10-11_12_54_59.0
141_S_2333 : 2018-12-26_15_31_55.0
031_S_2018 : 2019-01-24_11_26_13.0
041_S_0679 : 2017-07-05_09_46_36.0
037_S_0303 : 2017-05-11_13_39_46.0
037_S_0454 : 2017-09-06_09_41_25.0
068_S_2187 : 2019-10-09_13_19_17.0
116_S_4043 : 2018-03-02_10_03_10.0

3 年前

回复了 constantstranger 创建的主题 » 在python中,如何根据整数在ID中的位置拆分字符串类型的ID?

以下是一个答案:

使用Python正则表达式语法解析ID(处理带有或不带连字符的情况,如果需要,可以调整以适应历史ID的其他特性)
将ID设置为正规格式
为ID组件添加列
基于ID组件进行排序,以便将行“分组”在一起(尽管不是熊猫的“groupby”意思)

import pandas as pd
df = pd.DataFrame({'BinLocation':['A0233B21', 'A02033B21', 'A02-033-B21', 'A02-33-B21', 'A02-33-B15', 'A02-30-B21', 'A01-33-B21']})
print(df)
print()
df['RawBinLocation'] = df['BinLocation']
import re
def parse(s):
    m = re.match('^([A-Z])([0-9]{2})-?([0-9]+)-?([A-Z])([0-9]{2})$', s)
    if not m:
        return None
    tup = m.groups()
    colChar, colInt, rowInt, levelChar, levelInt = tup[0], int(tup[1]), int(tup[2]), tup[3], int(tup[4])
    tup = (colChar, colInt, rowInt, levelChar, levelInt)
    return pd.Series(tup)
df[['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt']] = df['BinLocation'].apply(parse)
df['BinLocation'] = df.apply(lambda x: f"{x.ColChar}{x.ColInt:02}-{x.RowInt:03}-{x.LevChar}{x.LevInt:02}", axis=1)
df.sort_values(by=['ColChar', 'ColInt', 'RowInt', 'LevChar', 'LevInt'], inplace=True, ignore_index=True)
print(df)

输出:

   BinLocation
0     A0233B21
1    A02033B21
2  A02-033-B21
3   A02-33-B21
4   A02-33-B15
5   A02-30-B21
6   A01-33-B21

   BinLocation RawBinLocation ColChar  ColInt  RowInt LevChar  LevInt
0  A01-033-B21     A01-33-B21       A       1      33       B      21
1  A02-030-B21     A02-30-B21       A       2      30       B      21
2  A02-033-B15     A02-33-B15       A       2      33       B      15
3  A02-033-B21       A0233B21       A       2      33       B      21
4  A02-033-B21      A02033B21       A       2      33       B      21
5  A02-033-B21    A02-033-B21       A       2      33       B      21
6  A02-033-B21     A02-33-B21       A       2      33       B      21

3 年前

回复了 constantstranger 创建的主题 » 在两个python字典之间迭代,并能够分别“递增”到下一个字符串索引

一种方法是使用迭代器。看见 iter() 和 next() ,这两个Python内置函数都有文档记录 here 和 here 了解一些背景。

dict1 = {
   'key1': 1,
   'key2': 2,
   'key3': 3
}

dict2 = {
   'key1': 0,
   'key2': 2,
   'key3': 4
}

it1 = iter(dict1)
it2 = iter(dict2)
v1 = next(it1, None)
v2 = next(it2, None)
count = 0
while v1 is not None or v2 is not None:
    print(f"v1, dict1[v1] {v1, dict1[v1] if v1 is not None else 'na'}, v2, dict2[v2] {v2, dict2[v2] if v2 is not None else 'na'}")
    if v1 is not None and dict1[v1] == 0:
        v1 = next(it1, None)
    if v2 is not None and dict2[v2] == 0:
        v2 = next(it2, None)
    # simulate dice rolling and decrement mentioned in the question
    count += 1
    if count == 3:
        dict1['key1'] = 0
    elif count == 6:
        dict2['key2'] = 0
    elif count == 9:
        dict1['key2'] = 0
        dict2['key3'] = 0
    elif count == 12:
        dict1['key3'] = 0
print(f"v1, dict1[v1] {v1, dict1[v1] if v1 is not None else 'na'}, v2, dict2[v2] {v2, dict2[v2] if v2 is not None else 'na'}")

输出:

v1, dict1[v1] ('key1', 1), v2, dict2[v2] ('key1', 0)
v1, dict1[v1] ('key1', 1), v2, dict2[v2] ('key2', 2)
v1, dict1[v1] ('key1', 1), v2, dict2[v2] ('key2', 2)
v1, dict1[v1] ('key1', 0), v2, dict2[v2] ('key2', 2)
v1, dict1[v1] ('key2', 2), v2, dict2[v2] ('key2', 2)
v1, dict1[v1] ('key2', 2), v2, dict2[v2] ('key2', 2)
v1, dict1[v1] ('key2', 2), v2, dict2[v2] ('key2', 0)
v1, dict1[v1] ('key2', 2), v2, dict2[v2] ('key3', 4)
v1, dict1[v1] ('key2', 2), v2, dict2[v2] ('key3', 4)
v1, dict1[v1] ('key2', 0), v2, dict2[v2] ('key3', 0)
v1, dict1[v1] ('key3', 3), v2, dict2[v2] (None, 'na')
v1, dict1[v1] ('key3', 3), v2, dict2[v2] (None, 'na')
v1, dict1[v1] ('key3', 0), v2, dict2[v2] (None, 'na')
v1, dict1[v1] (None, 'na'), v2, dict2[v2] (None, 'na')

3 年前

回复了 constantstranger 创建的主题 » 按用户筛选包含特定字符的数据帧(Python)

首先,为了解决您的错误消息 contains() 方法要求字符串作为其第一个参数,而不是列表。

它需要的字符串是字符序列或正则表达式(请参见 here )它将尝试匹配,我认为这与您正在尝试的不同,即查找包含所有输入字母的名称行。

为此,您可以使用以下方法,例如:

import pandas as pd
data = {'Name': ['Aerial', 'Tom', 'Amie', 'Anuj'],
        'Age': [27, 24, 22, 32],
        'Address': ['pennsylvania', 'newyork', 'newjersey', 'delaware'],
        'Qualification': ['Msc', 'MA', 'MCA', 'Phd']}
df = pd.DataFrame(data)
df["Name"] = df["Name"].str.lower()
#letters_in = input('Words in Name Column that contain these letters: \n ').split()
letters_in = ['a', 'i']
new_output = df[df.apply(lambda x: all(letter in x['Name'] for letter in letters_in), axis=1)]
print(new_output)

输出:

     Name  Age       Address Qualification
0  aerial   27  pennsylvania           Msc
2    amie   22     newjersey           MCA

3 年前

回复了 constantstranger 创建的主题 » 如何使用python将for循环输出数据帧合并为一个?

这是一个答案,它给出的结果与问题框架的形式略有不同,但使用“a”和“B”的值作为 index 和 columns 数据帧结果,这可能更能描述最终结果:

import pandas as pd

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)

输出:

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

下面是创建通用矩阵的上述方法与使用 numpy 在另一个使用硬编码列名的答案中显示:

import pandas as pd
import numpy as np

lists = {'A' : ['AA', 'BB', 'CC'], 'B' : ['AC', 'BC', 'CC']}
df = pd.DataFrame(data=[[sum(c != d for c, d in zip(lists['B'][i], lists['A'][j])) for j in range(len(lists['A']))] for i in range(len(lists['B']))], index=lists['B'], columns=lists['A'])
print(df)


dfa = pd.DataFrame(['AA', 'BB', 'CC'], columns=list('A'))
dfb = pd.DataFrame(['AC', 'BC', 'CC'], columns=list('B'))

def foo(dfa, dfb):
    df = pd.DataFrame(data=[[sum(c != d for c, d in zip(dfb['B'][i], dfa['A'][j])) for j in range(len(dfa['A']))] for i in range(len(dfb['B']))], index=dfb['B'], columns=dfa['A'])
    return df
    


def bar(dfa, dfb):
    a = np.array(dfa['A'].str.split('').str[1:-1].tolist())
    b = np.array(dfb['B'].str.split('').str[1:-1].tolist())
    dfb[['disB_1', 'disB_2', 'disB_3']] = (a != b[:, None]).sum(axis=2)
    return dfb

import timeit

print("\nGeneral matrix approach:")
t = timeit.timeit(lambda: foo(dfa, dfb), number = 100)
print(f"timeit: {t}")

print("\nHarcoded columns approach:")
t = timeit.timeit(lambda: bar(dfa, dfb), number = 100)
print(f"timeit: {t}")

通过 timeit :

    AA  BB  CC
AC   1   2   1
BC   2   1   1
CC   2   2   0

General matrix approach:
timeit: 0.023536499997135252

Harcoded columns approach:
timeit: 0.03922149998834357

这似乎表明 努比 这种方法大约需要1.5-2倍于这个答案中的一般矩阵方法。

» constantstranger 创建的更多回复