ssh
免密码登录# rename 's/old-name/new-name/' files
gongjing@hekekedeiMac ..ject/meme_img % ll
total 1.3M
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 logoRBNS_A1CF.eps
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 logoRBNS_BOLL.eps
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 logoRBNS_CELF1.eps
gongjing@hekekedeiMac ..ject/meme_img % rename 's/logoRBNS_//' *eps
gongjing@hekekedeiMac ..ject/meme_img % ll
total 1.3M
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 A1CF.eps
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 BOLL.eps
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 CELF1.eps
gongjing@hekekedeiMac ..ject/meme_img % rename 's/.eps//' *eps
gongjing@hekekedeiMac ..ject/meme_img % ll
total 1.3M
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 A1CF
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 BOLL
-rw-r--r-- 1 gongjing staff 14K Jun 19 14:47 CELF1
有的时候rename
命令不正常,可能是因为使用的命令来源不同,就像这里讨论的:
首先要查看环境的rename
是来源,然后写对应的命令,不能弄错了:
$ rename --version
/usr/bin/rename using File::Rename version 0.20
$ rename 's/\.jpeg$/.jpg/' *
$ rename --version
rename from util-linux 2.30.2
$ rename .jpeg .jpg *
# 集群上使用的是util-linux这个版本
# 命令:rename pattern replace files
$ rename --version
rename (util-linux-ng 2.17.2)
### use cut -f and --complement
[zhangqf5@loginview02 RBPgroup_CLIPseq_hg38_anno]$ head test.txt|awk -F "\t" '{print NF}'
31
31
31
31
31
31
31
31
31
31
[zhangqf5@loginview02 RBPgroup_CLIPseq_hg38_anno]$ head test.txt|awk 'BEGIN{OFS="\t";}{if($17==10)print}'|cut -f17 --complement|awk -F "\t" '{print NF}'
30
30
30
30
30
30
30
30
# https://theunarchiver.com/command-line
$ wget https://cdn.theunarchiver.com/downloads/unarMac.zip
$ ll
-rwxr-xr-x 1 gongjing staff 1.8M May 19 2016 lsar
-rwxr-xr-x 1 gongjing staff 1.8M May 19 2016 unar
$ unar
unar v1.10.1 (May 19 2016), a tool for extracting the contents of archive files.
Usage: unar [options] archive [files ...]
Available options:
-output-directory (-o) <string> The directory to write the contents of the archive to. Defaults to the current directory. If set to a single dash (-), no files will be
created, and all data will be output to stdout.
-force-overwrite (-f) Always overwrite files when a file to be unpacked already exists on disk. By default, the program asks the user if possible, otherwise skips
the file.
-force-rename (-r) Always rename files when a file to be unpacked already exists on disk.
-force-skip (-s) Always skip files when a file to be unpacked already exists on disk.
-force-directory (-d) Always create a containing directory for the contents of the unpacked archive. By default, a directory is created if there is more than one
top-level file or folder.
-no-directory (-D) Never create a containing directory for the contents of the unpacked archive.
-password (-p) <string> The password to use for decrypting protected archives.
-encoding (-e) <encoding name> The encoding to use for filenames in the archive, when it is not known. If not specified, the program attempts to auto-detect the encoding
used. Use "help" or "list" as the argument to give a listing of all supported encodings.
-password-encoding (-E) <name> The encoding to use for the password for the archive, when it is not known. If not specified, then either the encoding given by the -encoding
option or the auto-detected encoding is used.
-indexes (-i) Instead of specifying the files to unpack as filenames or wildcard patterns, specify them as indexes, as output by lsar.
-no-recursion (-nr) Do not attempt to extract archives contained in other archives. For instance, when unpacking a .tar.gz file, only unpack the .gz file and not
its contents.
-copy-time (-t) Copy the file modification time from the archive file to the containing directory, if one is created.
-no-quarantine (-nq) Do not copy Finder quarantine metadata from the archive to the extracted files.
-forks (-k) <fork|visible|hidden|skip> How to handle Mac OS resource forks. "fork" creates regular resource forks, "visible" creates AppleDouble files with the extension ".rsrc",
"hidden" creates AppleDouble files with the prefix "._", and "skip" discards all resource forks. Defaults to "fork".
-quiet (-q) Run in quiet mode.
-version (-v) Print version and exit.
-help (-h) Display this information.
#!/bin/bash
array=( "Vietnam" "Germany" "Argentina" )
array2=( "Asia" "Europe" "America" )
for ((i=0;i<${#array[@]};++i)); do
printf "%s is in %s\n" "${array[i]}" "${array2[i]}"
done
Run binary command which depends on GLIBC_2.14
and GLIBC_2.17
:
[zhangqf5@loginview02 bin]$ ./clan_annotate
./clan_annotate: /lib64/libc.so.6: version `GLIBC_2.14' not found (required by ./clan_annotate)
./clan_annotate: /lib64/libc.so.6: version `GLIBC_2.17' not found (required by ./clan_annotate)
Check GCC version:
[zhangqf5@loginview02 bin]$ ldd --version
ldd (GNU libc) 2.12
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
# print lines of files matched with specific pattern
[zhangqf7@loginview02 bwa]$ wl /Share/home/zhangqf7/jinsong_zhang/zebrafish/data/iclip/20181224/Rawdata/shi-zi-*/bwa/CTK_Procedure{1,2,3,4}/CITS/iCLIP.tag*p05.bed|awk '{print $1}'
15450
11876
26860
19994
69867
174233
13396
10204
24288
18042
65161
161563
2059
1369
1983
1569
6291
13434
2329
1535
2171
1758
6762
14614
666808
[zhangqf7@loginview02 bwa]$ wl /Share/home/zhangqf7/jinsong_zhang/zebrafish/data/iclip/20181224/Rawdata/shi-zi-*/bwa/CTK_Procedure{1,2,3,4}/CITS/iCLIP.tag*p05.bed|awk '{print $1}'|xargs -n4 -d'\n'
15450 11876 26860 19994
69867 174233 13396 10204
24288 18042 65161 161563
2059 1369 1983 1569
6291 13434 2329 1535
2171 1758 6762 14614
666808
Use option -sfn
of ln
command as discussed here:
ln -sfn {path/to/file-name} {link-name}
./configure --prefix=/somewhere/else/than/usr/local
make
make install
# get full path
work_space=$(pwd)
# get dir name instead of full path
work_dir_name=${PWD##*/}
参考这里:
# 这里第一个参数是GPU ID,第2个及之后的参数是要训练的文件夹名称(遍历训练)
for i in "${@:2}"
do
echo "process: "$i", with GPU: "$1
cd $expr/$i
bash train.sh $1
done
# $1: 第一个参数
# "${@:2}": 第二个及之后的所有参数
# "$@": 获得所有参数
# "${@:3:4}": 第3个参数开始的4个参数,即$3,$4,$5,$6
### https://zhuanlan.zhihu.com/p/40444240
# 检查系统含有的GLIBC版本
$ strings /lib64/libc.so.6 |grep GLIBC
GLIBC_2.2.5
GLIBC_2.2.6
GLIBC_2.3
GLIBC_2.3.2
GLIBC_2.3.3
GLIBC_2.3.4
GLIBC_2.4
GLIBC_2.5
GLIBC_2.6
GLIBC_2.7
GLIBC_2.8
GLIBC_2.9
GLIBC_2.10
GLIBC_2.11
GLIBC_2.12
GLIBC_PRIVATE
# 下载、安装
$ wget https://ftp.gnu.org/gnu/glibc/glibc-2.14.tar.gz
$ tar zxf glibc-2.14.tar.gz
$ cd glibc-2.14
$ mkdir build
$ cd build
$ ../configure --prefix=/opt/glibc-2.14
$ make -j4
$ make install
# 添加到LD_LIBRARYPATH
$ export LD_LIBRARY_PATH=/opt/glibc-2.14/lib:$LD_LIBRARY_PATH
ssh
免密码登录网上教程很多,随便参考,比如这里:
在自己的Linux系统上生成SSH密钥和公钥
➜ ~ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/gongjing/.ssh/id_rsa):
/home/gongjing/.ssh/id_rsa already exists.
Overwrite (y/n)? y
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/gongjing/.ssh/id_rsa.
Your public key has been saved in /home/gongjing/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:RWPGrhUHIbbp9DvzJX2cUOrzyFyibEdd8qU/7Mi5KUU gongjing@omnisky
The key's randomart image is:
+---[RSA 2048]----+
| o.Bo |
| . Bo.. |
| +..o . |
| o oo E+ o|
| So. .o.+o|
| . ..+ooo|
| + ++++o|
| .*=+X+.|
| .o+Xoo.|
+----[SHA256]-----+
将SSH公钥上传到Linux服务器(写在对方的~/.ssh/authorized_keys
中)
➜ ~ ssh-copy-id gongjing@10.10.91.12
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/gongjing/.ssh/id_rsa.pub"
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
gongjing@10.10.91.12's password:
Number of key(s) added: 1
Now try logging into the machine, with: "ssh 'gongjing@10.10.91.12'"
and check to make sure that only the key(s) you wanted were added.
如果id_rsa
之前是其他用户(比如root)创建的,有可能会出现权限错误:
➜ ~ ssh gongjing@10.10.91.12
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0755 for '/home/gongjing/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/home/gongjing/.ssh/id_rsa": bad permissions
gongjing@10.10.91.12's password:
此时可以更改权限,使得文件id_rsa
只有自己可读写(参考这里):
chmod 600 ~/.ssh/id_rsa
使用方法可参考这里:
-a
:迭代同步,子目录、文件链接等都会同步-P
:显示同步进度,速率、已完成、剩余等# 同步文件夹
rsync -aP download_20191204 gongjing@10.10.91.12:/home/gongjing/
# 同步文件
rsync -aP ./Untitled.ipynb gongjing@10.10.91.12:/home/gongjing/scripts
参考这里:
if [ "$color_prompt" = yes ]; then
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
else
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
fi
unset color_prompt force_color_prompt
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;33;1m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\] \$ '
# User specific aliases and functions
alias vb='vi ~/.bashrc'
alias sb='source ~/.bashrc'
alias les='less -S'
alias wl='wc -l'
alias lt='ls -lht'
alias ll='ls -lh'
描述性统计量:
术语:
Here is the note of reading Think Stats. The Chinese version online is here
术语:
This resource compare the basic usage in both languages, which is pretty direct.
函数:gsub(pattern, replacement, x)
,例子如下:
group <- c("12357e", "12575e", "197e18", "e18947")
group
[1] "12357e" "12575e" "197e18" "e18947"
gsub("e", "", group)
[1] "12357" "12575" "19718" "18947"
函数:commandArgs
,例子如下:
args <- commandArgs(trailingOnly = TRUE)
print(args[1]) # args[0] -> script anme
函数:startsWith
,endsWith
,例子如下:
> startsWith("what", "wha")
[1] TRUE
> startsWith("what", "ha")
[1] FALSE
函数:read.table
的参数check.names
默认对于特殊字符是要进行转换的,保持的例子如下:
read.csv(file, sep=",", header=T, check.names = FALSE)
args<-commandArgs(TRUE)
library(RUVSeq)
# f='Control_MO_H4_H6.txt'
f=args[1]
zfGenes = read.table(f, header=T, row.names=1, sep='\t')
filter <- apply(zfGenes, 1, function(x) length(x[x>5])>=2)
filtered <- zfGenes[filter,]
genes <- rownames(filtered)[grep("^N", rownames(filtered))]
spikes <- rownames(filtered)[grep("^ERCC", rownames(filtered))]
condition1=args[2]
condition2=args[3]
x <- as.factor(rep(c(condition1, condition2), each=2))
print(x)
set <- newSeqExpressionSet(as.matrix(filtered), phenoData = data.frame(x, row.names=colnames(filtered)))
head(counts(set))
set1 <- RUVg(set, spikes, k=1)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts(set1), colData = pData(set1), design = ~ W_1 + x) # use spike-in adjust
# dds <- DESeqDataSetFromMatrix(countData = counts(set), colData = pData(set), design = ~x) # use not adjust
dds <- DESeq(dds)
res <- results(dds)
res <- res[order(res$padj), ]
## Merge with normalized count data
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Transcript"
head(resdata)
resdata = resdata[resdata$pvalue>=0,]
## Write results
savefn=args[4]
write.csv(resdata, file=savefn)
.Rdata
数据> my = load('/Share2/home/zhangqf5/gongjing/rare_cell/tools/CellSIUS/Codes_Data/input/Supplementary_File_8_sce_raw.RData')
# 查看加载的变量
> my
[1] "sce"
# 不是list形式调用
> my$sce
Error in my$sce : $ operator is invalid for atomic vectors
# 通过ls命令查看环境定义的变量
# 核对一下刚加载的是哪个
> ls()
[1] "code_dir" "input_dir" "my" "out_data_dir" "plotdir"
[6] "sce"
# 查看加载的变量具有哪些属性可以使用
> attributes(sce)
$.__classVersion__
R Biobase eSet ExpressionSet SCESet
"3.4.1" "2.36.2" "1.3.0" "1.0.0" "1.1.9"
$logExprsOffset
[1] 1
$lowerDetectionLimit
[1] 0
$cellPairwiseDistances
dist(0)
$featurePairwiseDistances
dist(0)
$reducedDimension
<0 x 0 matrix>
$bootstraps
numeric(0)
$sc3
list()
$featureControlInfo
An object of class 'AnnotatedDataFrame'
rowNames: 1
varLabels: name
varMetadata: labelDescription
因为这个是biology的数据,所以有一些通用的属性,以及取值的方法:
# 使用exprs获取count数据
# 这也是我们一般使用的expression matrix数据
# 行:gene,列:cell id
> countdata<-exprs(sce)
> dim(countdata)
[1] 23848 11680
> head(countdata,n=2)
JRK_DA234_C2782 JRK_DA234_C2783 JRK_DA234_C2784 JRK_DA234_C2785
ENSG00000238009 0 0 0.9093836 0
ENSG00000239945 0 0 0.9093836 0
# fData: 获取feature data,这里是基因id
# 这个matrix相当于是对基因的注释,各种注释信息放进来
> fdata<-fData(sce)
> dim(fdata)
[1] 23848 15
> head(fdata, head=2)
gene_id chr symbol gene_biotype mean_exprs
ENSG00000238009 ENSG00000238009 1 RP11-34P13.7 lincRNA 0.0046305668
ENSG00000239945 ENSG00000239945 1 RP11-34P13.8 lincRNA 0.0004008153
ENSG00000279457 ENSG00000279457 1 FO538757.2 protein_coding 0.1828388937
ENSG00000228463 ENSG00000228463 1 AP006222.2 lincRNA 0.2726365525
ENSG00000236601 ENSG00000236601 1 RP4-669L17.2 lincRNA 0.0006401731
ENSG00000237094 ENSG00000237094 1 RP4-669L17.10 lincRNA 0.0098135864
exprs_rank n_cells_exprs total_feature_exprs pct_total_exprs
ENSG00000238009 7274 59 54.085020 7.842510e-05
ENSG00000239945 2200 6 4.681523 6.788366e-06
ENSG00000279457 16875 2019 2135.558279 3.096632e-03
ENSG00000228463 18491 2948 3184.394934 4.617480e-03
ENSG00000236601 3144 9 7.477221 1.084222e-05
ENSG00000237094 8776 120 114.622690 1.662068e-04
pct_dropout total_feature_counts log10_total_feature_counts
ENSG00000238009 99.49486 61 1.792392
ENSG00000239945 99.94863 6 0.845098
ENSG00000279457 82.71404 2307 3.363236
ENSG00000228463 74.76027 3724 3.571126
ENSG00000236601 99.92295 9 1.000000
ENSG00000237094 98.97260 121 2.086360
pct_total_counts is_feature_control_MT is_feature_control
ENSG00000238009 2.977997e-05 FALSE FALSE
ENSG00000239945 2.929177e-06 FALSE FALSE
ENSG00000279457 1.126269e-03 FALSE FALSE
ENSG00000228463 1.818042e-03 FALSE FALSE
ENSG00000236601 4.393765e-06 FALSE FALSE
ENSG00000237094 5.907174e-05 FALSE FALSE
> colnames(fdata)
[1] "gene_id" "chr"
[3] "symbol" "gene_biotype"
[5] "mean_exprs" "exprs_rank"
[7] "n_cells_exprs" "total_feature_exprs"
[9] "pct_total_exprs" "pct_dropout"
[11] "total_feature_counts" "log10_total_feature_counts"
[13] "pct_total_counts" "is_feature_control_MT"
[15] "is_feature_control"
# pData: 获取phenotype data,这里是cell id
# 这个matrix相当于是对细胞的注释,各种注释信息放进来
> pdata<-pData(sce)
> dim(pdata)
[1] 11680 38
> head(pdata, head=2)
cell_idx Batch cell_line cell_cycle_phase G1
IMR90_HCT116_C1 IMR90_HCT116_C1 IMR90_HCT116 HCT116 G1 1.000
IMR90_HCT116_C3 IMR90_HCT116_C3 IMR90_HCT116 HCT116 G1 0.972
IMR90_HCT116_C4 IMR90_HCT116_C4 IMR90_HCT116 IMR90 G1 0.999
> colnames(pdata)
[1] "cell_idx"
[2] "Batch"
[3] "cell_line"
[4] "cell_cycle_phase"
[5] "G1"
[6] "S"
[7] "G2M"
[8] "total_counts"
[9] "log10_total_counts"
[10] "filter_on_total_counts"
[11] "total_features"
[12] "log10_total_features"
[13] "filter_on_total_features"
[14] "pct_dropout"
[15] "exprs_feature_controls_MT"
[16] "pct_exprs_feature_controls_MT"
[17] "filter_on_pct_exprs_feature_controls_MT"
[18] "counts_feature_controls_MT"
[19] "pct_counts_feature_controls_MT"
[20] "filter_on_pct_counts_feature_controls_MT"
[21] "n_detected_feature_controls_MT"
[22] "n_detected_feature_controls"
[23] "counts_feature_controls"
[24] "pct_counts_feature_controls"
[25] "filter_on_pct_counts_feature_controls"
[26] "pct_counts_top_50_features"
[27] "pct_counts_top_100_features"
[28] "pct_counts_top_200_features"
[29] "pct_counts_top_500_features"
[30] "pct_counts_top_50_endogenous_features"
[31] "pct_counts_top_100_endogenous_features"
[32] "pct_counts_top_200_endogenous_features"
[33] "pct_counts_top_500_endogenous_features"
[34] "counts_endogenous_features"
[35] "log10_counts_feature_controls_MT"
[36] "log10_counts_feature_controls"
[37] "log10_counts_endogenous_features"
[38] "is_cell_control"
# 查看单个加载的模块的版本
> packageVersion('scater')
[1] ‘1.10.1’
# 查看整个运行环境:加载的包及版本
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)
Matrix products: default
BLAS: /Share/home/zhangqf5/software/R/3.5.1/lib64/R/lib/libRblas.so
LAPACK: /Share/home/zhangqf5/software/R/3.5.1/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] scater_1.10.1 ggplot2_3.1.0
[3] SingleCellExperiment_1.4.1 SummarizedExperiment_1.12.0
[5] DelayedArray_0.8.0 BiocParallel_1.16.6
[7] matrixStats_0.54.0 Biobase_2.28.0
[9] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2
[11] IRanges_2.16.0 S4Vectors_0.20.1
[13] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] beeswarm_0.2.3 tidyselect_0.2.5 xfun_0.4
[4] reshape2_1.4.3 purrr_0.3.2 HDF5Array_1.10.1
[7] lattice_0.20-38 rhdf5_2.26.2 colorspace_1.4-0
[10] htmltools_0.3.6 viridisLite_0.3.0 yaml_2.2.0
[13] rlang_0.3.1 pillar_1.3.1 glue_1.3.0
[16] withr_2.1.2 GenomeInfoDbData_1.2.0 plyr_1.8.4
[19] stringr_1.3.1 zlibbioc_1.14.0 munsell_0.5.0
[22] gtable_0.2.0 evaluate_0.12 knitr_1.21
[25] vipor_0.4.5 Rcpp_1.0.1 scales_1.0.0
[28] backports_1.1.3 XVector_0.22.0 gridExtra_2.3
[31] digest_0.6.18 stringi_1.2.4 dplyr_0.8.0.1
[34] grid_3.5.1 rprojroot_1.3-2 tools_3.5.1
[37] bitops_1.0-6 magrittr_1.5 lazyeval_0.2.1
[40] RCurl_1.95-4.11 tibble_2.0.1 crayon_1.3.4
[43] pkgconfig_2.0.2 Matrix_1.2-16 DelayedMatrixStats_1.4.0
[46] ggbeeswarm_0.6.0 assertthat_0.2.0 rmarkdown_1.10
[49] viridis_0.5.1 Rhdf5lib_1.4.3 R6_2.3.0
[52] compiler_3.5.1
有时候需要指定的bioconductor版本,才能正确的运行程序。因为不同的版本其安装的包的版本不同,如果不同版本差异很大,经常是容易报错的。比如最近遇到的这个:scRNAseq_workflow_benchmark,其强调说的bioconductor
版本是3.5,如果是其他版本,跑出来报错,尤其是通过其安装的包scater
,差异很大,已知运行不成功。
> tools:::.BioC_version_associated_with_R_version()
[1] ‘3.5’
比如可能要安装低版本的包,需要先在bioconductor中找到对应的,然后指定安装url:
# 比如:安装低版本的scater
# 这里没有成功,因为还有几个其他依赖的包安装不成功
install_version('scater', version='1.4.0',repos = "https://bioconductor.org/packages/3.5/bioc")
# 先下载源文件,一般是压缩的
install.packages('/Share2/home/zhangqf5/gongjing/software/mvoutlier_2.0.9.tar.gz', repos = NULL, type="source")
# https://stat.ethz.ch/R-manual/R-devel/library/utils/html/remove.packages.html
remove.packages(pkgs, lib)
在集群上,有的时候自己的目录下面安装了包,但是在系统调用的时候可能没有加载进来,可能是因为libpath中的路径存在问题,导致一些依赖包的顺序不对,所以不能正常的调用:
# 系统路径是在前的
# 直接调用时,依赖的包是系统的R安装的
# 与用户的R版本不一致,出现问题
> .libPaths()
[1] "/Share/home/zhangqf5/R/x86_64-pc-linux-gnu-library/3.3"
[2] "/Share/home/zhangqf/usr/R-3.3.0/lib64/R/library"
[3] "/Share/home/zhangqf/usr/R-3.3.2/lib64/R/library"
[4] "/Share/home/zhangqf5/software/R/3.5.1/lib64/R/library"
> library(mvoutlier)
Loading required package: sgeostat
Error: package or namespace load failed for ‘mvoutlier’:
package ‘rrcov’ was installed by an R version with different internals; it needs to be reinstalled for use with this R version
>
# 更换lib顺序,使得用户的libpath优先调用
> .libPaths("/Share/home/zhangqf5/software/R/3.5.1/lib64/R/library")
>
> library(mvoutlier)
sROC 0.1-2 loaded
>
有时候会遇到这种问题,先导入了module1,module1默认是导入module2(v1)的,接下来要导入module3,module3是依赖于module2(v2)的,因为v2和v1版本不同,所以使得不能正常导入module2。此时可以先把module1 deattach掉,然后专门的导入module2:
比如下面先导入了模块m6ALogisticModel
,其默认导入ggplot2
,接下来导入caret
,但是其依赖的ggplot2
版本不同,所以导入时报错:
> library(caret)
Loading required package: ggplot2
Error in value[[3L]](cond) :
Package ‘ggplot2’ version 3.1.0 cannot be unloaded:
Error in unloadNamespace(package) : namespace ‘ggplot2’ is imported by ‘m6ALogisticModel’ so cannot be unloaded
可以先将m6ALogisticModel
移除(参考这里),再导入:
> detach("package:m6ALogisticModel", unload=TRUE)
> library(caret)
Loading required package: ggplot2
参考这里:
# 提取training df中列名称不为class的其他列
Training.predictor.Gendata = training[,!names(training) %in% c("class")]
机器学习规则(google: Rules of ML):
进行机器学习的基本方法是:
[zhangqf7@loginview02 RIP]$ head readcount_h2.txt
NM_212847 1922.0 3057.0
NM_212844 1552.0 245.0
NM_001004627 1901.0 22507.0
NM_212849 15338.0 2889.0
NM_205638 3326.0 13646.0
library("DESeq2")
countdata <- read.table("/Share/home/zhangqf7/gongjing/zebrafish/data/RIP/readcount_h2.txt", header=FALSE, row.names=1)
colnames(countdata) = c("HuR", "Control")
countdata <- as.matrix(countdata)
condition = factor(c("HuR", "Control"))
(coldata <- data.frame(row.names=colnames(countdata), condition))
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata, design=~condition)
dds <- DESeq(dds)
# Get differential expression results
res <- results(dds)
#table(res$padj<0.05)
## Order by adjusted p-value
res <- res[order(res$padj), ]
## Merge with normalized count data
resdata <- merge(as.data.frame(res), as.data.frame(counts(dds, normalized=TRUE)), by="row.names", sort=FALSE)
names(resdata)[1] <- "Transcript"
head(resdata)
## filter by pvalue or any criteria else
resdata = resdata[resdata$pvalue<0.05,]
## Write results
write.csv(resdata, file="/Share/home/zhangqf7/gongjing/zebrafish/data/RIP/readcount_h2.DE.csv")
Finally the content of result object resdata
, and can be saved to a csv file:
Transcript baseMean log2FoldChange lfcSE stat pvalue
1 NM_212844 1139.1833 2.539801 1.121616 2.264413 0.02354874
2 NM_001004627 9612.8973 -2.147552 1.010351 -2.125550 0.03354078
3 NM_212849 11431.3248 2.497761 1.064955 2.345415 0.01900593
4 NM_001159988 1066.9555 -2.126481 1.068342 -1.990450 0.04654139
5 NM_001159983 1337.8418 -2.272092 1.083839 -2.096337 0.03605233
6 ENSDART00000172192 307.6127 -2.232198 1.097730 -2.033467 0.04200536
padj HuR Control
1 0.3812277 2097.04486 181.3218
2 0.3812277 2568.60972 16657.1849
3 0.3812277 20724.53227 2138.1173
4 0.3812277 268.88655 1865.0245
5 0.3812277 291.85676 2383.8269
6 0.3812277 67.55943 547.6659
Circos can show the connections between different elements, e.g., genomic regions, RNA molecules and so on. Another userful tool tableviewer can be used to display the percentage of each crosslinks among multiple relationships. In RISE database we collected RNA-RNA interactions (RRIs) from various sources, and tableviewer are used for visualization of the landscape of RRIs. Here is the example:
Generally we can read any txt file into pandas data frame, and use function groupby & count to count the entry for each pair of two variables. Here is the saved data frame:
[zhangqf5@loginview02 data]$ cat RRI_union_deduplicates.split_full.type_dis.human.txt
data proteinCoding lncRNA miRNA rRNA snoRNA snRNA tRNA NoncanonicalRNA others
proteinCoding 41863.0 4756.0 183.0 14128.0 115.0 721.0 66.0 3520.0 2762.0
lncRNA 9982.0 1080.0 1984.0 194.0 40.0 91.0 2.0 748.0 482.0
miRNA 35798.0 304.0 36.0 14.0 6.0 23.0 1.0 458.0 117.0
rRNA 1347.0 147.0 3.0 360.0 31.0 15.0 27.0 100.0 83.0
snoRNA 105.0 21.0 2.0 290.0 299.0 86.0 6.0 24.0 10.0
snRNA 969.0 170.0 1.0 41.0 29.0 851.0 1.0 70.0 52.0
tRNA 22.0 1.0 0.0 3.0 1.0 1.0 210.0 3.0 23.0
NoncanonicalRNA 6010.0 705.0 1076.0 69.0 104.0 69.0 1.0 1132.0 271.0
others 2850.0 384.0 328.0 74.0 12.0 43.0 5.0 262.0 400.0
Then we can add header (more details can be found here) to define color for each element (RNA type here):
data 1 2 3 4 5 6 7 8 9
data 202,75,78 83,169,102 205,185,111 98,180,208 129,112,182 238,130,238 255,140,0 74,113,178 169,169,169
Once concatenate header and data frame, the combined data is ready for plot:
[zhangqf5@loginview02 data]$ cat RRI_union_deduplicates.split_full.type_dis.human.txt
data 1 2 3 4 5 6 7 8 9
data 202,75,78 83,169,102 205,185,111 98,180,208 129,112,182 238,130,238 255,140,0 74,113,178 169,169,169
data proteinCoding lncRNA miRNA rRNA snoRNA snRNA tRNA NoncanonicalRNA others
proteinCoding 41863.0 4756.0 183.0 14128.0 115.0 721.0 66.0 3520.0 2762.0
lncRNA 9982.0 1080.0 1984.0 194.0 40.0 91.0 2.0 748.0 482.0
miRNA 35798.0 304.0 36.0 14.0 6.0 23.0 1.0 458.0 117.0
rRNA 1347.0 147.0 3.0 360.0 31.0 15.0 27.0 100.0 83.0
snoRNA 105.0 21.0 2.0 290.0 299.0 86.0 6.0 24.0 10.0
snRNA 969.0 170.0 1.0 41.0 29.0 851.0 1.0 70.0 52.0
tRNA 22.0 1.0 0.0 3.0 1.0 1.0 210.0 3.0 23.0
NoncanonicalRNA 6010.0 705.0 1076.0 69.0 104.0 69.0 1.0 1132.0 271.0
others 2850.0 384.0 328.0 74.0 12.0 43.0 5.0 262.0 400.0
I wrote a python script to call tableviewer:
import subprocess, os
import sys
def tableview(txt=None, parse_table=None, make_conf=None, conf_dir=None, circos_conf=None, save_dir=None, parsed_conf=None):
if parse_table is None:
parse_table = '/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer/bin/parse-table'
if make_conf is None:
make_conf = '/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer/bin/make-conf'
if conf_dir is None:
conf_dir = '/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer2/data'
if circos_conf is None:
circos_conf = '/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer2/etc/circos.conf'
if save_dir is None:
save_dir = os.path.dirname(txt)
if parsed_conf is None:
#parsed_conf = '/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer/samples/parse-table-02a.conf'
parsed_conf = '/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer2/etc/parse-table.conf'
print "[tableview start] file: %s"%(txt)
subprocess.call(["cat {txt} | {parse_table} -conf {parsed_conf} -segment_order=ascii,size_desc -placement_order=row,col -interpolate_type count -col_order_row -use_col_order_row -col_color_row -use_col_color_row -ribbon_layer_order=size_asc | {make_conf} -dir {conf_dir}".format(txt=txt, parse_table=parse_table, parsed_conf=parsed_conf, make_conf=make_conf, conf_dir=conf_dir)],shell=True)
#subprocess.call(["cat {txt} | {parse_table} -conf {parsed_conf} | {make_conf} -dir {conf_dir}".format(txt=txt, parse_table=parse_table, parsed_conf=parsed_conf, make_conf=make_conf, conf_dir=conf_dir)],shell=True)
file_png = txt.split('/')[-1].replace('txt', 'png')
subprocess.call(["circos -conf {circos_conf} -outputdir {save_dir} -outputfile {file_png} -param random_string=zgvickusamp| grep created".format(circos_conf=circos_conf, save_dir=save_dir, file_png=file_png)],shell=True)
print "[tableview end] file: %s"%(txt)
def main():
#tableview(txt='/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer/samples/RRI_union_deduplicates.split_full.type_dis.txt')
if len(sys.argv) == 1:
txt = '/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer2/samples/RRI_union_deduplicates.split_full.type_dis.txt'
else:
txt = sys.argv[1]
tableview(txt)
if __name__ == '__main__':
main()
Run the script with previous data:
[zhangqf5@loginview02 tableviewer2]$ pwd
/Share/home/zhangqf5/gongjing/software/circos-tools-0.22/tools/tableviewer2
[zhangqf5@loginview02 tableviewer2]$ ll
total 40K
drwxr-x--- 2 zhangqf5 zhangqf 4.0K Jul 6 2017 batch
drwxr-x--- 2 zhangqf5 zhangqf 4.0K Aug 2 2017 bin
drwxr-x--- 2 zhangqf5 zhangqf 4.0K May 19 02:21 data
drwxr-x--- 2 zhangqf5 zhangqf 4.0K May 19 02:21 etc
drwxr-x--- 2 zhangqf5 zhangqf 4.0K Jul 6 2017 img
drwxr-x--- 2 zhangqf5 zhangqf 4.0K Jul 6 2017 lib
-rwxr----- 1 zhangqf5 zhangqf 2.4K Jul 6 2017 makeimage.py
drwxr-x--- 2 zhangqf5 zhangqf 4.0K Jul 6 2017 results
drwxr-x--- 2 zhangqf5 zhangqf 4.0K Jul 6 2017 samples
drwxr-x--- 2 zhangqf5 zhangqf 4.0K Jul 6 2017 uploads
[zhangqf5@loginview02 tableviewer2]$ python makeimage.py /Share/home/zhangqf5/gongjing/DNA-RNA-Protein-interaction-correlation-12-18/data/RRI_union_deduplicates.split_full.type_dis.human.txt
We get plot like this:
However, there is a issue. The connection between RNAs actually has no direction, thus the ribbon color from RNA1 to RNA2 must be the same as from RNA2 to RNA1. In the graph above, for example, there are two bands connect mRNA (red) and others (grey), but these two colors are different. In this case, we need to parse the data table to make all count values apear in only one side of the diagonal (upper or lower).
Here is the function to transform the data format:
from nested_dict import nested_dict
def read_txt(txt='/Share/home/zhangqf5/gongjing/DNA-RNA-Protein-interaction-correlation-12-18/data/type_dis/RRI_union_deduplicates.split_full.type_dis.human.txt'):
dis_dict = nested_dict(2, int)
dis_revise_dict = nested_dict(2, int)
with open(txt, 'r') as TXT:
for n,line in enumerate(TXT):
line = line.strip()
print n,line
if n == 0:
dis_dict['row_order'] = line
elif n == 1:
dis_dict['row_color'] = line
elif n == 2:
dis_dict['row_rnas'] = line
col_rna_ls = line.split()[1:]
else:
row_rna = line.split()[0]
val_ls = map(float, line.split()[1:])
for col_rna, val in zip(col_rna_ls, val_ls):
dis_dict[row_rna][col_rna] = val
print dis_dict
rna_pair_ls = []
for row_rna in col_rna_ls:
for col_rna in col_rna_ls:
dis_revise_dict[row_rna][col_rna] = 0
for row_rna in col_rna_ls:
for col_rna in col_rna_ls:
rna_pair1 = row_rna + '-' + col_rna
rna_pair2 = col_rna + '-' + row_rna
if rna_pair1 in rna_pair_ls:
continue
if rna_pair2 in rna_pair_ls:
continue
rna_pair_ls.append(rna_pair1)
rna_pair_ls.append(rna_pair2)
dis_revise_dict[row_rna][col_rna] += dis_dict[row_rna][col_rna]
if row_rna == col_rna:
continue
dis_revise_dict[row_rna][col_rna] += dis_dict[col_rna][row_rna]
print dis_revise_dict
savefn = txt.replace('txt', 'revise.txt')
with open(savefn, 'w') as SAVEFN:
print >>SAVEFN, dis_dict['row_order']
print >>SAVEFN, dis_dict['row_color']
print >>SAVEFN, dis_dict['row_rnas']
for row_rna in col_rna_ls:
row_rna_ls = [dis_revise_dict[row_rna][col_rna] for col_rna in col_rna_ls]
print >>SAVEFN,row_rna+' '+' '.join(map(str, row_rna_ls))
return savefn
After converting, we get the data with all value on upper triangle:
[zhangqf5@loginview02 data]$ cat RRI_union_deduplicates.split_full.type_dis.human.revise.txt
data 1 2 3 4 5 6 7 8 9
data 202,75,78 83,169,102 205,185,111 98,180,208 129,112,182 238,130,238 255,140,0 74,113,178 169,169,169
data proteinCoding lncRNA miRNA rRNA snoRNA snRNA tRNA NoncanonicalRNA others
proteinCoding 41863.0 14738.0 35981.0 15475.0 220.0 1690.0 88.0 9530.0 5612.0
lncRNA 0 1080.0 2288.0 341.0 61.0 261.0 3.0 1453.0 866.0
miRNA 0 0 36.0 17.0 8.0 24.0 1.0 1534.0 445.0
rRNA 0 0 0 360.0 321.0 56.0 30.0 169.0 157.0
snoRNA 0 0 0 0 299.0 115.0 7.0 128.0 22.0
snRNA 0 0 0 0 0 851.0 2.0 139.0 95.0
tRNA 0 0 0 0 0 0 210.0 4.0 28.0
NoncanonicalRNA 0 0 0 0 0 0 0 1132.0 533.0
others 0 0 0 0 0 0 0 0 400.0
That is:
Then using these data we can visualize the links without direction as below: