df中使用cut进行分bin,获得对应的bin值importlib.pythonrc文件在打开python console时自动执行def solve(r1, r2):
     # sort the two ranges such that the range with smaller first element
     # is assigned to x and the bigger one is assigned to y
     x, y = sorted((r1, r2))
     #now if x[1] lies between x[0] and y[0](x[1] != y[0] but can be equal to x[0])
     #then the ranges are not overlapping and return the differnce of y[0] and x[1]
     #otherwise return 0 
     if x[0] <= x[1] < y[0] and all( y[0] <= y[1] for y in (r1,r2)):
        return y[0] - x[1]
     return 0
... 
>>> solve([0,10],[12,20])
2
>>> solve([5,10],[1,5])
0
>>> solve([5,10],[1,4])
1
# the specific order
sorter = ['a', 'c', 'b']
df['column'] = df['column'].astype("category")
df['column'].cat.set_categories(sorter, inplace=True)
df.sort_values(["column"], inplace=True)
A free Python/R notebook can also be created online at https://rdrr.io/.
t = pd.DataFrame({1:[1,2,3], 2:[3,4,5], 3:[6,7,8]})
t
	1	2	3
0	1	3	6
1	2	4	7
2	3	5	8
# by row sum
t.div(t.sum(axis=1), axis=0)
	1	2	3
0	0.100000	0.300000	0.600000
1	0.153846	0.307692	0.538462
2	0.187500	0.312500	0.500000
# by column sum
t.div(t.sum(axis=0), axis=1)
	1	2	3
0	0.166667	0.250000	0.285714
1	0.333333	0.333333	0.333333
2	0.500000	0.416667	0.380952
def dfs_tabs(df_list, sheet_list, file_name):
    writer = pd.ExcelWriter(file_name,engine='xlsxwriter')   
    for dataframe, sheet in zip(df_list, sheet_list):
        dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0, index=False)   
    writer.save()
As discussed here:
df = pd.DataFrame({'favcount':[1,2,3], 'sn':['a','b','c']})
print (df)
   favcount sn
0         1  a
1         2  b
2         3  c
print (df.favcount.idxmax())
2
print (df.ix[df.favcount.idxmax()])
favcount    3
sn          c
Name: 2, dtype: object
print (df.ix[df.favcount.idxmax(), 'sn'])
c
主要参考这篇文章:Changing the sans-serif font to Helvetica。转换好的字体文件放在了这里,可下载使用。
# 在mac上找到Helvetica字体
$ ls /System/Library/Fonts/Helvetica.ttc
# 复制到其他的位置
$ cp /System/Library/Fonts/Helvetica.ttc ~/Desktop
# 使用online的工具转换为.tff文件
# 这里使用的是: https://www.files-conversion.com/font/ttc
# 定位python库的字体文件
$ python -c 'import matplotlib ; print(matplotlib.matplotlib_fname())'
/Users/gongjing/usr/anaconda2/lib/python2.7/site-packages/matplotlib/mpl-data/matplotlibrc
# 将tff文件放到上述路径的font目录下
$ cp Helvetica.ttf /Users/gongjing/usr/anaconda2/lib/python2.7/site-packages/matplotlib/mpl-data/fonts/ttf
# 修改matplotlibrc文件
#font.sans-serif : DejaVu Sans, Bitstream Vera Sans, Computer Modern Sans Serif, Lucida Grande, Verdana, Geneva, Helvetica, Lucid, Arial, Avant Garde, sans-serif
=》font.sans-serif : Helvetica, DejaVu Sans, Bitstream Vera Sans, Computer Modern Sans Serif, Lucida Grande, Verdana, Geneva, Lucid, Arial, Avant Garde, sans-serif
# 重启jupyter即可
上面是设置全局的,也可以显示的在代码中指定,可以参考这里:
# 显示指定在此脚本中用某个字体
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Helvetica"
# 对于不同的部分(标题、刻度等)指定不同的字体
csfont = {'fontname':'Comic Sans MS'}
hfont = {'fontname':'Helvetica'}
plt.title('title',**csfont)
plt.xlabel('xlabel', **hfont)
plt.show()
画图时,如果需要使用中文label,需要设置,主要参考这里:
SimHei黑体字体文件.tff文件放到matplotlib包的路径下,路径为:matplotlib/mpl-data/fonts/ttf,可以使用pip show matplotlib查看包安装的位置matplotlibrc,一般在matplotlib/mpl-data/这个下面。
    /Users/gongjing/.matplotlib下面缓存的字体文件plt.rcParams["font.family"] = "SimHei"可以使用numpy的函数nan_to_num:numpy.nan_to_num(x, copy=True)]
x = np.array([np.inf, -np.inf, np.nan, -128, 128])
# 默认copy=True,不改变原来数组的值
np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
        -1.28000000e+002,   1.28000000e+002])
        
# 设置copy=False,原来数组的值会被替换
np.nan_to_num(x, copy=False)
As discussed here:
import os
def check_dir_or_make(d):
    if not os.path.exists(d):
        os.makedirs(d)
As discussed here:
df = pd.DataFrame({'Name': ['Steve_Smith', 'Joe_Nadal', 
                           'Roger_Federer'],
                 'Age':[32,34,36]})
                 
# Age	Name
# 0	32	Steve_Smith
# 1	34	Joe_Nadal
# 2	36	Roger_Federer
df[['First','Last']] = df.Name.str.split("_",expand=True,)
# expand需要设置为True,负责报错说原来df没有“first”,“last”列
# Age	Name	First	Last
# 0	32	Steve_Smith	Steve	Smith
# 1	34	Joe_Nadal	Joe	Nadal
# 2	36	Roger_Federer	Roger	Federer
df中使用cut进行分bin,获得对应的bin值# 将数据分成10组
bins = 10
df = pd.DataFrame.from_dict({'value':[i/10 for i in range(10+1)]})
df['bins'] = pd.cut(df['value'], bins=bins)
df
value	bins
0	0.0	(-0.001, 0.1]
1	0.1	(-0.001, 0.1]
2	0.2	(0.1, 0.2]
3	0.3	(0.2, 0.3]
4	0.4	(0.3, 0.4]
5	0.5	(0.4, 0.5]
6	0.6	(0.5, 0.6]
7	0.7	(0.6, 0.7]
8	0.8	(0.7, 0.8]
9	0.9	(0.8, 0.9]
10	1.0	(0.9, 1.0]
# 通过以value_counts()的index获得唯一的bin
# 打印:此时每一个i是interval对象
for i in list(df['bins'].value_counts().index):
    print(i)
(-0.001, 0.1]
(0.9, 1.0]
(0.8, 0.9]
(0.7, 0.8]
(0.6, 0.7]
(0.5, 0.6]
(0.4, 0.5]
(0.3, 0.4]
(0.2, 0.3]
(0.1, 0.2]
# interval对象不能通过index进行取值
i[0]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-46-3aa51af8ff05> in <module>()
----> 1 i[0]
TypeError: 'pandas._libs.interval.Interval' object does not support indexing
# interval对象有特定的属性进行取值等操作
# closed, left, right, closed_left, closed_right, mid, open_left, open_right
i.left
0.1
i.right
0.2
国内有一些镜像,在安装时使用这些镜像会加快下载的速度,可参考这里。
临时修改(安装时指定):
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
永久修改(安装源写入配置文件~/.pip/pip.conf):
[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
参考Efficient method to calculate the rank vector of a list in Python:
import scipy.stats as ss
ss.rankdata([3, 1, 4, 15, 92])
# array([ 2.,  1.,  3.,  4.,  5.])
ss.rankdata([1, 2, 3, 3, 3, 4, 5])
# array([ 1.,  2.,  4.,  4.,  4.,  6.,  7.])
参考这里:
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})
import time
start =time.clock()
#中间写上代码块
end = time.clock()
print('Running time: %s Seconds'%(end-start))
参考这里:
In [326]:
df = pd.DataFrame({'id':['a','a','b','c','c'], 'words':['asd','rtr','s','rrtttt','dsfd']})
df
Out[326]:
  id   words
0  a     asd
1  a     rtr
2  b       s
3  c  rrtttt
4  c    dsfd
In [327]:
df.groupby('id')['words'].apply(','.join)
Out[327]:
id
a        asd,rtr
b              s
c    rrtttt,dsfd
Name: words, dtype: object
# 注意,这里是有两行,所以以id进行group之后,只剩下word
# groupby之后是得到一个df
# groupby()[col]是选取对应的column,但是选出的column是series,不是直接的list
df.groupby('photo_id')['like_flag'].apply(lambda x: np.cumsum(list(x))).to_dict()
参考这里:
import numpy as np
a = [4,6,12]
np.cumsum(a)
#array([4, 10, 22])
参考这里:
   num
0    1
1    6
2    4
3    5
4    2
input = 3
# 这里是选取的最接近的前2个,控制index可选择1个等
df.iloc[(df['num']-input).abs().argsort()[:2]]
   num
2    4
4    2
importlib构建文件目录:
gongjing@bjzyx-c451:~/gj_py_func $ pwd
/home/gongjing/gj_py_func
gongjing@bjzyx-c451:~/gj_py_func $ lst
.
  |-list.py
  |-file.py
  |-dataframe.py
  |-init.py
  |-__pycache__
  |  |-list.cpython-37.pyc
  |  |-file.cpython-37.pyc
  |  |-dataframe.cpython-37.pyc
  |  |-load_packages.cpython-37.pyc
  |-.ipynb_checkpoints
  |  |-file-checkpoint.py
  |  |-dataframe-checkpoint.py
  |  |-list-checkpoint.py
调用:
import importlib, sys
if '/home/gongjing/' not in sys.path: sys.path.append('/home/gongjing/')
func_df = importlib.import_module('.dataframe', package='gj_py_func')
func_file = importlib.import_module('.file', package='gj_py_func')
func_ls = importlib.import_module('.list', package='gj_py_func')
importlib.reload(func_df)
importlib.reload(func_file)
importlib.reload(func_ls)
# 查看模块信息,包含哪些函数
help(func_df)
Help on module gj_py_func.dataframe in gj_py_func:
NAME
    gj_py_func.dataframe
FUNCTIONS
    df_col_missing_pct(df)
    
    df_col_sum(df)
    
    df_norm_by_colsum(df)
    
    df_norm_by_rowsum(df)
    
    df_row_sum(df)
    
    load_data(fn, col_ls=None)
FILE
    /home/gongjing/gj_py_func/dataframe.py
参考这里:
df.fillna({1:0}, inplace=True)
df[1].fillna(0, inplace=True)
参考这里:
if item:
        JD = item.group()
        csvwriter.writerow(JD)
# J,D,",", ,C,o,l,u,m,b,i,a, ,L,a,w, ,S,c,h,o,o,l,....
# one string per row
csvwriter.writerow([JD])
.pythonrc文件在打开python console时自动执行参考这里:
1, 写.pythonrc文件
print("how_use_pythonstartup")
2, 设置PYTHONSTARTUP变量
export PYTHONSTARTUP=~/.pythonrc
3, 打开新的console
Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
how_use_pythonstartup
>>>