R语言字符串函数详解

2016-07-19 16:44:58来源:作者:数据人人点击

一、以下为stringr包的字符串处理函数: 1. 字符串的大小写转换 str toupper(string, locale = "") str tolower(string, locale = "") str totitle(string, locale = "")

```r library(stringr)#加载包 dog <- "The quick brown dog" str toupper(dog) #将英文字符串转换成大写 str tolower(dog) #将英文字符串转换成小写 str totitle(dog) #将英文字符串中的单词首字母转换大写

locale可设置不同的语种

str toupper("i", "en") # English str toupper("i", "tr") # Turkish ```

2. invert_match 返回非匹配模式的起始位置 `rnumbers <- "1 and 2 and 4 and 456"num_loc <- str_locate_all(numbers, "[0-9]+")[[1]] #匹配数字,返回数字的起始位置num_locstr_sub(numbers, num_loc[, "start"], num_loc[, "end"])text_loc <- invert_match(num_loc) #返回不匹配数字的起始位置text_locstr_sub(numbers, text_loc[, "start"], text_loc[, "end"])

`

3. modifiers 指定模式的类别

fixed(pattern, ignore_case = FALSE):Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.

coll(pattern, ignore_case = FALSE, locale = NULL, ...):Compare strings respecting standard collation rules.

regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE, dotall = FALSE, ...):默认使用正则表达式

boundary(type = c("character", "line break", "sentence", "word"), skipword_none = TRUE, ...):Match boundaries between things.

pattern: Pattern to modify behaviour.

ignore_case: Should case differences be ignored in the match?

locale: Locale to use for comparisons. See stri localelist() for all possible options.

...: Other less frequently used arguments passed onto stri optscollator, stri optsregex, or stri optsbrkiter

multiline: If TRUE, $ and ^ match the beginning and end of each line. If FALSE, the default, only match the start and end of the input.

comments: If TRUE, whitespace and comments beginning with # are ignored. Escape literal spaces with / .

dotall: If TRUE, . will also match line terminators.

type: Boundary type to detect.

skip wordnone: Ignore "words" that don’t contain any characters or numbers - i.e. punctuation.

```r pattern <- "a.b" strings <- c("abb", "a.b") str detect(strings, pattern) strdetect(strings, fixed(pattern)) str_detect(strings, coll(pattern))

coll() is useful for locale-aware case-insensitive matching

i <- c("I", "/u0130", "i") i str detect(i, regex('i', TRUE)) strdetect(i, fixed("i", TRUE)) str detect(i, coll("i", TRUE)) strdetect(i, coll("i", TRUE, locale = "tr"))

Word boundaries 单词边界

words <- c("These are some words.") str count(words, boundary("word")) #统计语句中单词的个数 strsplit(words, " ")[[1]] #将语句分割成单个词组,最后一个单词带有标点 str_split(words, boundary("word"))[[1]]#最后一个单词不带有标点

使用正则表达式

str extractall("The Cat in the Hat", "[a-z]+")#区分大小写 str extractall("The Cat in the Hat", regex("[a-z]+", TRUE))#忽略大小写的差异

str extractall("a/nb/nc", "^.") str extractall("a/nb/nc", regex("^.", multiline = TRUE))

str extractall("a/nb/nc", "a.") str extractall("a/nb/nc", regex("a.", dotall = TRUE)) ```

4. str_c 连接字符串

str_c(..., sep = "", collapse = NULL)

str_join(..., sep = "", collapse = NULL)

```r str c("Letter: ", letters[1:5]) strc("Letter", letters[1:5], sep = ": ") #sep可设置连接符 str c(letters[1:5], " is for", "...") strc(letters[-26], " comes before ", letters[-1]) str c(letters) strc(letters, collapse = "") #collapse 将一个向量的所有元素连接成一个字符串,collapse设置元素间的连接符 str_c(letters, collapse = ", ")

Missing inputs give missing outputs

str_c(c("a", NA, "b"), "-d")

Use str replaceNA to display literal NAs:

str c(strreplace_na(c("a", NA, "b")), "-d") ```

5. str_conv 指定字符串的编码 str_conv(string, encoding) `rx <- rawToChar(as.raw(177))xstr_conv(x, "ISO-8859-2") # Polish "a with ogonek"str_conv(x, "ISO-8859-1") # Plus-minus

`

6. str_count 计算字符串中的匹配模式的数目 str_count(string, pattern = "") `rfruit <- c("apple", "banana", "pear", "pineapple")str_count(fruit, "a") #计算向量fruit的每个元素含有a的数目str_count(fruit, "p")str_count(fruit, "ap")str_count(fruit, "e")str_count(fruit, c("a", "b", "p", "p"))str_count(c("a.", "...", ".a.a"), ".") #正则表达式中‘.’是指单个字符,不仅仅是字符‘.’str_count(c("a.", "...", ".a.a"), fixed("."))#fixed(".")指字符‘.’

`

7. str_detect 检测字符串中是否存在某种模式 str_detect(string, pattern)

```r fruit <- c("apple", "banana", "pear", "pinapple") str detect(fruit, "a") #fruit的元素是否包含a strdetect(fruit, "pp") str detect(fruit, "^a") #fruit的元素是否以a开头 strdetect(fruit, "a$") #fruit的元素是否以a结尾 str detect(fruit, "b") strdetect(fruit, "[aeiou]") #fruit的元素是否包含[aeiou]中的一个字符

Also vectorised over pattern

str_detect("aecfg", letters) ```

8. str_dup 重复和连接字符串向量 str_dup(string, times) `rfruit <- c("apple", "pear", "banana")str_dup(fruit, 2) # 向量的每个元素重复2次,然后连接起来str_dup(fruit, 1:3)str_c("ba", str_dup("na", 0:5))

`

9. str_extract 从字符串中提取匹配的模式

str_extract(string, pattern) 提取匹配的第一个字符串

str extractall(string, pattern, simplify = FALSE) 提取匹配的所有字符串

```r shopping_list <- c("apples 4x4", "bag of flour", "bag of sugar", "milk x2")

提取匹配模式的第一个字符串

str extract(shoppinglist, "/d") # 提取数字 str extract(shoppinglist, "[a-z]+") #提取字母 str extract(shoppinglist, "[a-z]{1,4}") str extract(shoppinglist, "/b[a-z]{1,4}/b")

提取所有匹配模式的字符串,结果返回一个列表

str extractall(shopping list, "[a-z]+") strextract all(shoppinglist, "/b[a-z]+/b") str extractall(shopping_list, "/d")

提取所有匹配模式的字符串,结果返回一个矩阵,通过simplify = TRUE设置

str extractall(shopping list, "/b[a-z]+/b", simplify = TRUE) strextract all(shoppinglist, "/d", simplify = TRUE)

```

10. str_length 字符串的长度

```r str length(letters) strlength(NA) str length(factor("abc")) strlength(c("i", "like", "programming", NA))

Two ways of representing a u with an umlaut

u1 <- "/u00fc" u2 <- stringi::stri transnfd(u1)

The print the same:

u1 u2

But have a different length

str length(u1) strlength(u2)

Even though they have the same number of characters

str count(u1) strcount(u2)

```

11. str_locate 定位在字符串中匹配模式的位置

str_locate(string, pattern):返回匹配的第一个字符串的位置

str locateall(string, pattern):返回匹配的所有位置

```r fruit <- c("apple", "banana", "pear", "pineapple")

返回匹配的第一个字符串的位置:

str locate(fruit, "$") strlocate(fruit, "a") str locate(fruit, "ap") strlocate(fruit, "e") str_locate(fruit, c("a", "b", "p", "p"))

返回匹配的所有位置:

str locateall(fruit, "a") str locateall(fruit, "e") str locateall(fruit, c("a", "b", "p", "p"))

查找每个字符的位置

str locateall(fruit, "") ```

12. str_match 从字符串中提取匹配组

str_match(string, pattern) 提取匹配的第一个字符串

str matchall(string, pattern) 提取匹配的所有字符串

```r strings <- c(" 219 733 8965", "329-293-8753 ", "banana", "595 794 7569", "387 287 6718", "apple", "233.398.9187 ", "482 952 3315", "239 923 8115 and 842 566 4692", "Work: 579-499-7527", "$1000", "Home: 543.355.3679")

phone <- "([2-9][0-9]{2})- . - ."

str extract(strings, phone) #返回匹配的最长字符串 strmatch(strings, phone) #返回匹配的最长字符串,同时返回最长字符串中的子字符串

Extract/match all

str extractall(strings, phone) str matchall(strings, phone)

```

13. str_order 对字符向量进行排序

str order(x, decreasing = FALSE, nalast = TRUE, locale = "", ...)

str sort(x, decreasing = FALSE, nalast = TRUE, locale = "", ...)

```r str order(letters, locale = "en") strsort(letters, locale = "en") str order(letters, locale = "haw") strsort(letters, locale = "haw")

```

14. str_pad 在字符串的前后位置填充字符(如空格)

-str_pad(string, width, side = c("left", "right", "both"), pad = " ")

width:填充字符后字符串的长度; side:填充字符串的位置,默认为left; pad:指定填充的字符串;

```r rbind( str pad("hadley", 30, "left"), strpad("hadley", 30, "right"), str_pad("hadley", 30, "both") )

All arguments are vectorised except side

str pad(c("a", "abc", "abcdef"), 10) strpad("a", c(5, 10, 20)) str pad("a", 10, pad = c("-", "", " "))

Longer strings are returned unchanged

str pad("hadley", 3, pad = '-') strpad("hadley", width = 8, pad = '-')

```

15. str_replace 替换字符串中的匹配模式

str_replace(string, pattern, replacement)

str replaceall(string, pattern, replacement)

```r fruits <- c("one apple", "two pears", "three bananas") str replace(fruits, "[aeiou]", "-") #替换第一个匹配的字符 strreplace all(fruits, "[aeiou]", "-")#替换所有匹配的字符 strreplace(fruits, "([aeiou])", "") str replace(fruits, "([aeiou])", "/1/1") strreplace(fruits, "[aeiou]", c("1", "2", "3")) str_replace(fruits, c("a", "e", "i"), "-")

fruits <- c("one apple", "two pears", "three bananas") str replace(fruits, "[aeiou]", "-") strreplace all(fruits, "[aeiou]", "-") strreplace all(fruits, "([aeiou])", "") strreplace all(fruits, "([aeiou])", "/1/1") strreplace all(fruits, "[aeiou]", c("1", "2", "3")) strreplace_all(fruits, c("a", "e", "i"), "-")

If you want to apply multiple patterns and replacements to the same # string, pass a named version to pattern.

str replaceall(str_c(fruits, collapse = "---"), c("one" = 1, "two" = 2, "three" = 3))

```

16. str replacena 将缺失值替换成‘NA’ str replacena(string, replacement = "NA") `rstr_replace_na(c(NA, "abc", "def"))

`

17. str_split 根据一个分隔符将字符串进行分割

str_split(string, pattern, n = Inf)#结果返回列表

str splitfixed(string, pattern, n)#结果返回矩阵

```r fruits <- c( "apples and oranges and pears and bananas", "pineapples and mangos and guavas" ) str_split(fruits, " and ")

通过设置n,指定分割成n块

str split(fruits, " and ", n = 3) #将字符串分割成3块 strsplit(fruits, " and ", n = 2) #将字符串分割成2块 str_split(fruits, " and ", n = 5)

Use fixed to return a character matrix

str splitfixed(fruits, " and ", 3) str splitfixed(fruits, " and ", 4) str splitfixed(fruits, " and ", 6) ```

18. str_sub 按位置从字符向量中提取或替换子字符串

str_sub(string, start = 1L, end = -1L) 提取子字符串

str_sub(string, start = 1L, end = -1L) <- value 替换子字符串

```r hw <- "Hadley Wickham" str sub(hw, 1, 6) strsub(hw, end = 6) str sub(hw, 8, 14) strsub(hw, 8) str_sub(hw, c(1, 8), c(6, 14))

Negative indices

str sub(hw, -1) strsub(hw, -7) str_sub(hw, end = -7)

Alternatively, you can pass in a two colum matrix, as in the output from str locateall

pos <- str locateall(hw, "[aeio]")[[1]] str sub(hw, pos) strsub(hw, pos[, 1], pos[, 2])

Vectorisation

str sub(hw, seqlen(str length(hw))) strsub(hw, end = seq len(strlength(hw)))

替换

x <- "BBCDEF" str sub(x, 1, 1) <- "A"; x strsub(x, -1, -1) <- "K"; x str sub(x, -2, -2) <- "GHIJ"; x strsub(x, 2, -2) <- ""; x

```

19. str_subset 提取匹配模式的字符串向量元素 str_subset(string, pattern)

```r fruit <- c("apple", "banana", "pear", "pinapple") str subset(fruit, "a") strsubset(fruit, "ap") str subset(fruit, "^a") strsubset(fruit, "a$") str subset(fruit, "b") strsubset(fruit, "[aeiou]")

Missings are silently dropped

str_subset(c("a", NA, "b"), ".")

```

20. str_trim 删除字符串中的空格 str_trim(string, side = c("both", "left", "right")) `rstr_trim(" String with trailing and leading white space/t")str_trim("/n/nString with trailing and leading white space/n/n")

`

21. str_wrap

str_wrap(string, width = 80, indent = 0, exdent = 0)

width:每行的宽度

indent:设置首行缩进

exdent:设置第二行后每行缩进

`rthanks_path <- file.path(R.home("doc"), "THANKS")thanks <- str_c(readLines(thanks_path), collapse = "/n")thanks <- word(thanks, 1, 3, fixed("/n/n"))cat(str_wrap(thanks), "/n")cat(str_wrap(thanks, width = 70), "/n")cat(str_wrap(thanks, width = 60, indent = 6), "/n")cat(str_wrap(thanks, width = 80, indent = 6, exdent = 2), "/n")

`

22. word 从句子中提取单词 word(string, start = 1L, end = start, sep = fixed(" "))

```r sentences <- c("Jane saw a cat", "Jane sat down") word(sentences, 1) #提取第一个单词 word(sentences, 2) #提取第二个单词 word(sentences, -1) #提取句子的最后一个单词 word(sentences, 2, -1) #提取第二个单词到最后一个单词

Also vectorised over start and end

word(sentences[1], 1:3, -1) word(sentences[1], 1, 1:4)

指定分隔符

str <- 'abc.def..123.4568.999' word(str, 1, sep = fixed('..')) word(str, 2, sep = fixed('..')) ```

二、以下为基础包的字符串处理函数: 23. paste() 字符串连接:

paste(..., sep = " ", collapse = NULL)

`rpaste("A", 1:6, sep = "")paste("A", 1:6, sep = "", collapse = '-') #设置collapse时,将连成一个字符串paste(1:6, collapse = '')paste(1:6, collapse = '-')paste("Today is", date())

`

24. strsplit() 字符串分割:

strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE)

split:设置分割符 fixed:逻辑值,默认值为FALSE perl:逻辑值,默认值为FALSE,取TRUE时,分割符使用正则表达式 useBytes:逻辑值,默认值为FALSE,

```r x <- c(as = "asfef", qu = "qwerty", "yuiop[", "b", "stuff.blah.yech") strsplit(x, "e") unlist(strsplit("a.b.c", ".")) unlist(strsplit("a.b.c", "[.]"))#使用‘.’为分割符

或者:

unlist(strsplit("a.b.c", ".", fixed = TRUE)) x<-'ascd123afrwf34535ddggh454fgf5e4' unlist(strsplit(x, split = '[0-9]+', perl = TRUE))#以数字为分割符 unlist(strsplit(x, split = '[a-z]+', perl = TRUE))#以字母为分割符

```

25. nchar() 计算字符串的字符个数:

nchar(x, type = "chars", allowNA = FALSE)

`rx <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech")nchar(x)

`

26. substr 字符串截取及替换:

(1)substr(x, start, stop)

(2)substring(text, first, last = 1000000L)

(3)substr(x, start, stop) <- value

(4)substring(text, first, last = 1000000L) <- value

```r

对于单个字符串:

substr("abcdef", 2, 4) substring("abcdef", 2, 4) substring("abcdef", 1:6, 1:6) substr(rep("abcdef", 4), 1:4, 4:5)

对于字符串向量:

x <- c("asfef", "qwerty", "yuiop[", "b", "stuff.blah.yech") substr(x, 2, 5)#对向量x每个元素截取子字符串 substring(x, 2, 4:6) substring(x, 2) <- c("..", "+++")#以赋值进行替换 x ```

27. 字符串替换及大小写转换:

chartr(old, new, x) tolower(x) toupper(x) casefold(x, upper = FALSE)

`rx <- "MiXeD cAsE 123"chartr("iXs", "why", x)#i:w,X:h,s:y,单个字符对应替换chartr("a-cX", "D-Fw", x)tolower(x)#转换成小写toupper(x)#转换成大写casefold(x, upper = FALSE)casefold(x, upper = TRUE)

`

28. 字符匹配与替换

(1) grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE),结果返回匹配的向量x的元素的索引

ignore.case:逻辑值,默认值FALSE,区分大小写; perl:逻辑值,默认值FALSE,不使用正则表达式; value:逻辑值,设置结果返回匹配元素的值还是索引,默认值为FALSE:返回索引; fixed:逻辑值,默认值为FALSE,取值为TRUE时使用精确匹配; useBytes:逻辑值,默认取值FALSE; invert:逻辑值,默认取值FALSE,设置结果返回匹配还是非匹配的元素; `rtxt <- c("10arm03","Foot 12"," 678-lefroo.345", "__.bafoobar90..")grep(pattern = 'foo', x = txt, value = FALSE)#区分大小写,结果返回匹配的元素索引grep(pattern = 'foo', x = txt, value = TRUE)#区分大小写,结果返回匹配的元素值grep(pattern = 'foo', x = txt, ignore.case = TRUE)#忽略大小写,结果返回匹配的元素索引grep(pattern = 'foo', x = txt, ignore.case = TRUE, value = TRUE)#忽略大小写,结果返回匹配的元素值grep(pattern = 'foo', x = txt, ignore.case = TRUE, value = TRUE, invert = TRUE)#忽略大小写,结果返回不匹配的元素值grep(pattern = '^[0-9]+', x= txt, perl = TRUE)#返回以数字开头的元素索引grep(pattern = '[0-9]+$', x= txt, perl = TRUE, value = TRUE)#返回以数字结尾的元素grep(pattern = '//d$', x= txt, perl = TRUE, value = TRUE)#返回以数字结尾的元素

`

(2) grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE),结果返回一个与向量x等长的逻辑向量,匹配的元素返回TRUE,不匹配的返回FALSE。

`rtxt <- c("10arm03","Foot 12"," 678-lefroo.345", "__.bafoobar90..")grepl(pattern = 'foo', x = txt)grepl(pattern = '//d$', x = txt, perl = TRUE)

`

(3) sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE),替换匹配的元素的第一个字符串

`rtxt <- c("10arm03","Foot 12 foot"," 678-lefroo.345", "__.bafoobar90foobar..")sub(pattern = 'foo',replacement = '99', x = txt)#将元素中的第一个foo替换成99sub(pattern = '//d+$', replacement = '+++', x = txt, perl = TRUE)#将结尾的数字替换成+++

`

(4) gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE),替换匹配的元素的所有字符串

`rtxt <- c("10arm03","Foot 12 foot"," 678-lefroo.345", "__.bafoobar90foobar..")gsub(pattern = 'foo',replacement = '99', x = txt)#将所有的foo替换成99gsub(pattern = '//d+', replacement = '+++', x = txt, perl = TRUE)#将所有数字替换成+++

`

(5) regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE),结果返回每个元素匹配的第一个位置及字符数目,不匹配的元素返回的位置和长度都是-1。

`rtxt <- c("10arm03","Foot 12 foot"," 678-lefroo.345", "__.bafoobar90foobar..")regexpr(pattern = 'foo', text = txt)regexpr(pattern = '//d+', text = txt)

`

(6) gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE),返回每个元素匹配的所有位置及相应的字符数目

`rtxt <- c("10arm03","Foot 12 foot"," 678-lefroo.345", "__.bafoobar90foobar..")gregexpr(pattern = 'foo', text = txt)gregexpr(pattern = '//d+', text = txt)

`

(7) regexec(pattern, text, ignore.case = FALSE, fixed = FALSE, useBytes = FALSE)

`rtxt <- c(NA,"Foot 12 foot"," 678-lefroo.345", "__.bafoobar90foobar..")regexec(pattern = 'foo', text = txt)regexec(pattern = '//d+', text = txt)

`

严禁修改,可以转载,请注明出自数据人网和原文链接。

最新文章

123

最新摄影

微信扫一扫

第七城市微信公众平台