ðäŒæ¥å ã§å©çšããããµãŒãã¹ã»ããŒã«ã®ã¢ãœã·ãšãŒã·ã§ã³åæ
å æ¥ãããžãã¹SNSãšããŠæåãªWantedlyããäŒæ¥ã«å¯ŸããŠå©çšããŠãããµãŒãã¹ãããŒã«ã«ã€ããŠã®èª¿æ»ãè¡ã£ãçµæãcompany toolsãšããŠå ¬éããŸããã
ãã®ããŒãžã§ã¯ïŒã€ã®ã«ããŽãªãŒã«ã€ããŠãWantedlyã«ç»é²ãããŠãããã¹ãŠã®äŒæ¥ã§ã¯ãªãã§ããã該åœãããŠã§ããµãŒãã¹ãã¢ããã°ãªããŒã«ãå©çšããŠããäŒæ¥ã®æ°ãè©äŸ¡ã³ã¡ã³ããèŠãããšãã§ããŸãã
ããã¯ããã§å€§å€é¢çœãã®ã§ãããããã€ãæ°ã«ãªã£ãããšããããŸããããã¯
- ã©ãããããŒã«ã人æ°ãªã®ïŒ
- ã©ãããããŒã«ãçµã¿åãããŠå©çšããŠããã®ïŒ
- äŒæ¥ã®ç¹åŸŽïŒç€Ÿå¡æ°ãäºæ¥å 容ãªã©ïŒãšã®é¢ä¿ã¯ããã®ïŒ
ãšããããšã§ããïŒã«ã€ããŠã¯Wantedlyã®ä»¥äžã®ã«ããŽãªãŒããšã®éèšããŒãžãã¿ãã°ãããã®ã§ãããã©ãããªãRã§å³ç€ºããŠã¿ãããªããŸãããŸãïŒã«ã€ããŠã¯ããã®ååãè²·ã£ãå Žåã¯ãã®ååãè²·ãããšãã£ãé¢é£æ§ã調ã¹ãããã®ã¢ãœã·ãšãŒã·ã§ã³åæã®ææ³ã䜿ãããããªæ°ãããŸããæåŸã®ç¹ã«ã€ããŠã¯ãWantedlyã®åéããŒãžã«ããããã¹ããããšã«äŒæ¥ãåé¡ããã°è¯ãããªãšæããŸããããç€Ÿå¡æ°ãšãäºæ¥å 容ã®ããŒã¿ãåãããã«ãªãã£ãã®ã§ä¿çäžã§ãããšããããã§ïŒãšïŒã«ã€ããŠRã§ãã£ãŠã¿ãŸããã
- ð¡äººæ°ã®ããŒã«ã»ãµãŒãã¹ãå¯èŠåãã
- ð å©çšãããŠãããµãŒãã¹ã®é¢é£ãèŠã
- ð åè
- ð» å®è¡ç°å¢
ð¡äººæ°ã®ããŒã«ã»ãµãŒãã¹ãå¯èŠåãã
ãŸãã¯Wantedlyã§å ¬éãããŠããæ å ±ãããšã«Rã§å³ãæããŠã¿ãããšæããŸããcompany toolsã§èšå®ãããŠããæ¬¡ã®ïŒã€ã®ã«ããŽãªãŒã«ã€ããŠãåã«ããŽãªãŒããšã«äžäœ10äœãŸã§ã®äººæ°ã®ããŒã«ã»ãµãŒãã¹ãŸãšããŠã¿ãŸãã
- ã³ãã¥ãã±ãŒã·ã§ã³
- æ å ±å ±æã»èç©
- ãããžã§ã¯ã管ç
- æ¡çšã»è²æãµãŒãã¹
- å¶æ¥
- ããŒã±ãã£ã³ã°
- éçºã»ãã¯ãããžãŒ
- ãã¶ã€ã³
- ã«ã¹ã¿ããŒãµããŒã
ãå¯èŠåã®ããã®Rã³ãŒãïŒã¯ãªãã¯ã§è¡šç€ºïŒ
library(rvest) library(ggplot2) library(emoGG) library(gridExtra) library(viridis) library(dplyr) # ggplot2ã®èŠãç®ã倿Žããèšå® quartzFonts(YuGo = quartzFont(rep("YuGo-Medium", 4))) theme_set(theme_classic(base_size = 12, base_family = "YuGo")) base.url <- "https://www.wantedly.com/company_tools" # ã³ãã¥ãã±ãŒã·ã§ã³ããŒã«ã®ããŒã¿ãååŸ df.com <- read_html(paste(base.url, "categories", "communication", sep = "/")) %>% html_nodes(xpath = sprintf("//*[@id=\"company-tools\"]/div/div[2]/div/div/ul/div/li/div/span/a")) %>% { data_frame(service = html_nodes(., xpath = "div") %>% html_text(), count = html_nodes(., xpath = "h2") %>% html_text() %>% tidyr::extract_numeric(), category = "ã³ãã¥ãã±ãŒã·ã§ã³") %>% .[1:10, ] } ggplot(df.com, aes(reorder(service, count), count)) + geom_bar(stat = "identity", aes(fill = count)) + scale_fill_viridis() + geom_emoji(data = data.frame(x = 8:10, y = df.com$count[1:3] %>% sort()), aes(x = x, y = y), position = position_nudge(y = 10), emoji = "1f451") + guides(fill = FALSE) + xlab("ãµãŒãã¹") + ylab("å©çšããŠããäŒæ¥æ°") + ggtitle("人æ°ã®ã³ãã¥ãã±ãŒã·ã§ã³ãµãŒãã¹") + theme(axis.text.x = element_text(angle = 40, hjust = 1))
äžèšã®ã³ãŒãããã¹ãŠã®ã«ããŽãªãŒã«å¯ŸããŠå®è¡ããŠåŸãå³ã以äžã®ãã®ã«ãªããŸããåã«ããŽãªãŒã§äžäœïŒçš®ã«ã€ããŠã¯ðå ãã€ããŠããŸãïŒã¡ãã£ãšæåãæœ°ããŠããŸã£ãŠããŸã...ïŒã
SlackãGoogle Analyticsãå§ããAWSãGitHubã人æ°ã§ããããšãããããŸããã
ð å©çšãããŠãããµãŒãã¹ã®é¢é£ãèŠã
ããŠç¶ããŠïŒã®å 容ã«ã€ããŠå®è¡ããŠãããŸããåé ã§ãè¿°ã¹ãããã«ãä»åã®ãããªããŒã¿åœ¢åŒã¯äŒæ¥ãå©çšããŠãããµãŒãã¹ããã©ã³ã¶ã¯ã·ã§ã³ããŒã¿ãšã¿ãªããã¢ãœã·ãšãŒã·ã§ã³åæãè¡ãããšãã§ãããã§ãããŸãã¯åæã«å¿ èŠãªããŒã¿ãæ¹ããŠåéãããšããããå§ããŸããã¢ãœã·ãšãŒã·ã§ã³åæã®è©³çްã«ã€ããŠã¯ããã§ã¯æ·±ãè§Šããªãã®ã§ãæ«å°Ÿã®åèããŒãžãã芧ã«ãªãããšããå§ãããŸãã
察象ã®äŒæ¥ãªã¹ããçšæãã
ãã£ãšã人æ°ããããµãŒãã¹ãããGoogle Driveãã§263瀟ãå©çšããŠããããšã®ããšã§ãããcompany toolsã®ããŒãžã§è¡šç€ºãããã®ã¯ãããã®ãã¡ã®äžéšã§ãããŸããããŒãžèªã¿èŸŒã¿ã®åºŠã«è¡šç€ºãããäŒæ¥ãç°ãªãã®ã§ããã¹ãŠã®äŒæ¥ã察象ã«ã¯ããŠããŸãããäœåãã¢ã¯ã»ã¹ããŠWantedlyã®ãµãŒããŒã«è² è·ããããã®ãç³ãèš³ãªãã®ã§æ¥µåå°ãªãã»ãã·ã§ã³ããå¿ èŠãªæ å ±ãåŸãããã«ããŸããïŒåºŠã®ã¢ã¯ã»ã¹ã§è¡šç€ºãããäŒæ¥ãä»åã®åæã®å¯Ÿè±¡ãšããããã«ãªã¹ããäœæããŸãã
ã察象ã®äŒæ¥ãªã¹ããçšæããããã®Rã³ãŒãïŒã¯ãªãã¯ã§è¡šç€ºïŒ
library(pforeach) library(dplyr) df.company <- read_html(base.url) %>% html_nodes(xpath = "//*[@id=\"company-tools\"]/div/div/div/div/ul/div/li/div/div/div[2]/a") %>% { data_frame(url = html_attr(., name = "href") %>% paste0("https://www.wantedly.com", .), company = html_text(.)) %>% unique() } df.company %>% nrow()
ãšããããã§83ã®äŒæ¥ãå©çšããŠããããŒã«ã»ãµãŒãã¹ã®çµæãåæã«å©çšããŸããå šäœã®ïŒå²ããããªã®ã§åãããããããããŸããã
ãã©ã³ã¶ã¯ã·ã§ã³ããŒã¿ã®äœæãšåæã®å®è¡
{arules}
ããã±ãŒãžãå©çšããŸãã
ããã©ã³ã¶ã¯ã·ã§ã³ããŒã¿ã®äœæïŒã¯ãªãã¯ã§è¡šç€ºïŒ
library(pforeach) library(arules) df.res <- npforeach(i = 1:nrow(df.company), .c = rbind)({ Sys.sleep(3) read_html(df.company$url[i]) %>% html_nodes(xpath = "//*[@id=\"company-tools-company\"]/div/div/div/ul/li/div/a/div") %>% { dplyr::data_frame(id = i, item = paste0(html_nodes(., "div") %>% html_text(trim = TRUE), "=", html_nodes(., "h3") %>% html_text(trim = TRUE))) } }) # ããŒã«ã«ããŽãªãŒã®æ¥æ¬èªãä¿®æ£ df.res %<>% dplyr::mutate(item = gsub("ã³ãã¥ãã±ãŒã·ã§ã³ããŒã«", "communication", item), item = gsub("æ å ±å ±æã»èç©ããŒã«", "knowledge", item), item = gsub("ãããžã§ã¯ã管çããŒã«", "project_management", item), item = gsub("æ¡çšã»è²æãµãŒãã¹", "human_resource", item), item = gsub("å¶æ¥ããŒã«", "sales", item), item = gsub("ããŒã±ãã£ã³ã°ããŒã«", "marketing", item), item = gsub("éçºã»ãã¯ãããžãŒããŒã«", "development", item), item = gsub("ãã¶ã€ã³ããŒã«", "design", item), item = gsub("ã«ã¹ã¿ããŒãµããŒãããŒã«", "customer_support", item)) res.trans <- df.res %>% as.data.frame() %$% split(item, id) %>% as(., "transactions")
çæããããã©ã³ã¶ã¯ã·ã§ã³ããŒã¿ã確èªããŸãã
res.trans ## transactions in sparse format with ## 83 transactions (rows) and ## 129 items (columns) # äŒæ¥id = 1ã®å 容ã衚瀺 LIST(res.trans[1]) ## $`1` ## [1] "communication=Slack" "design=GIMP" "design=Illustrator" "design=Inkscape" ## [5] "design=Photoshop" "design=Pinterest" "development=AWS" "development=CircleCI" ## [9] "development=DeployGate" "development=GitHub" "development=Mackerel" "development=New Relic" ## [13] "development=wercker" "human_resource=Green" "human_resource=Linkedin" "human_resource=Wantedly Admin" ## [17] "knowledge=esa.io" "marketing=@press" "marketing=Google Analytics" "marketing=Google Search Console" ## [21] "marketing=Hootsuite" "marketing=Mailchimp" "marketing=Mixpanel" "marketing=Optimizely" ## [25] "marketing=PR TIMES" "marketing=Repro" "project_management=asana" "project_management=GitHub" ## [29] "project_management=pivotal tracker" "project_management=Trello" # ãã©ã³ã¶ã¯ã·ã§ã³ããŒã¿å šäœã®èŠçŽ summary(res.trans) %>% .@itemSummary ## marketing=Google Analytics knowledge=Google Drive communication=Slack design=Illustrator design=Photoshop (Other) ## 50 48 46 44 44 892 # çžå¯Ÿé »åºŠã§ã®é ç®ïŒãµãŒãã¹ãããŒã«ïŒã®äžäœãç¢ºèª itemFrequency(res.trans, type = "absolute") %>% head() ## communication=Chatwork communication=co-meeting communication=direct communication=Facebook messenger communication=Google Hangout ## 29 1 1 20 25 ## communication=Hipchat ## 4 # åçµã¿åããã®çºçããå²åã«ã€ããŠäžéšã衚瀺 # åæã«çºçããããšãªãçµã¿åããã® affinity 㯠0 affinity(res.trans)[1:5, 1:5] ## communication=Chatwork communication=co-meeting communication=direct communication=Facebook messenger communication=Google Hangout ## communication=Chatwork 0.0000000 0.00 0.00 0.3243243 0.2857143 ## communication=co-meeting 0.0000000 0.00 1.00 0.0500000 0.0000000 ## communication=direct 0.0000000 1.00 0.00 0.0500000 0.0000000 ## communication=Facebook messenger 0.3243243 0.05 0.05 0.0000000 0.3235294 ## communication=Google Hangout 0.2857143 0.00 0.00 0.3235294 0.0000000
ã§ã¯ããããapriori()
颿°ã䜿ã£ãŠãAprioriã¢ã«ãŽãªãºã ã«ããã¢ãœã·ãšãŒã·ã§ã³åæãå®è¡ããŸããAprioriã¢ã«ãŽãªãºã ã¯ã¢ãœã·ãšãŒã·ã§ã³åæã®ååãšããŠåºãå©çšãããã¢ã«ãŽãªãºã ãšãªã£ãŠããŸãã
# æ¯æåºŠ supportãšç¢ºä¿¡åºŠ confidence ãèª¿æŽ (rules <- res.trans %>% apriori(parameter = list(support = 0.3, confidence = 0.5, target = "rules"), control = list(verbose = FALSE))) ## set of 71 rules # 確信床ãé«ãé ã«äžŠã³æ¿ã rules <- sort(rules, decreasing = TRUE, by = "confidence")
apriori()
ã®çµæãåºåããã«ã¯inspect()
ã䜿ããŸãããã§ã«ç¢ºä¿¡åºŠã®é«ãé ã«äžŠã³æ¿ããŠããã®ã§ãäžéšã ãã衚瀺ããããã«ããŸãã
# apriori()ã®çµæãäžéšãåºå # æ¡ä»¶ lhs, çµè« rhs, æ¯æåºŠ, 確信床, ãªããã®é inspect(rules[1:10]) ## lhs rhs support confidence lift ## 61 {design=Photoshop,marketing=Google Analytics} => {design=Illustrator} 0.3975904 0.9705882 1.830882 ## 58 {design=Photoshop,knowledge=Google Drive} => {design=Illustrator} 0.3734940 0.9687500 1.827415 ## 54 {communication=Slack,design=Illustrator} => {design=Photoshop} 0.3132530 0.9629630 1.816498 ## 71 {design=Photoshop,knowledge=Google Drive,marketing=Google Analytics} => {design=Illustrator} 0.3132530 0.9629630 1.816498 ## 57 {design=Illustrator,knowledge=Google Drive} => {design=Photoshop} 0.3734940 0.9393939 1.772039 ## 7 {development=GitHub} => {development=AWS} 0.3493976 0.9354839 1.941129 ## 33 {design=Illustrator} => {design=Photoshop} 0.4939759 0.9318182 1.757748 ## 34 {design=Photoshop} => {design=Illustrator} 0.4939759 0.9318182 1.757748 ## 55 {communication=Slack,design=Photoshop} => {design=Illustrator} 0.3132530 0.9285714 1.751623 ## 70 {design=Illustrator,knowledge=Google Drive,marketing=Google Analytics} => {design=Photoshop} 0.3132530 0.9285714 1.751623
次ã®ãããªåºåãå¯èœã§ãã
# æ¡ä»¶ã«äžèŽããã«ãŒã«ãããã€ããã subset(rules, subset = rhs %in% "design=Illustrator") %>% inspect() %>% head() ## lhs rhs support confidence lift ## 61 {design=Photoshop,marketing=Google Analytics} => {design=Illustrator} 0.3975904 0.9705882 1.830882 ## 58 {design=Photoshop,knowledge=Google Drive} => {design=Illustrator} 0.3734940 0.9687500 1.827415 ## 71 {design=Photoshop,knowledge=Google Drive,marketing=Google Analytics} => {design=Illustrator} 0.3132530 0.9629630 1.816498 ## ... eclat(res.trans, parameter = list(support = 0.6)) %>% sort(decreasing = TRUE, by = "support") %>% inspect() ## ... ## items support ## 1 {marketing=Google Analytics} 0.6024096
# ã©ããããµãŒãã¹ãå©çšããŠããå Žåã«äœµããŠGitHubãå©çšããŠããã rules.lhs.gh <- res.trans %>% apriori( appearance = list(default = "lhs",rhs = "development=GitHub"), control = list(verbose = FALSE)) %>% sort(decreasing = TRUE, by = "support") inspect(rules.lhs.gh[1:5]) ## lhs rhs support confidence lift ## 27 {communication=Slack,development=AWS} => {development=GitHub} 0.2771084 0.8214286 2.199309 ## 26 {development=AWS,project_management=GitHub} => {development=GitHub} 0.2409639 0.8695652 2.328191 ## 73 {communication=Slack,development=AWS,project_management=GitHub} => {development=GitHub} 0.2289157 0.9500000 2.543548 ## 3 {development=New Relic} => {development=GitHub} 0.2048193 0.8947368 2.395586 ## 19 {development=AWS,development=New Relic} => {development=GitHub} 0.2048193 0.8947368 2.395586
åæçµæãæŠèгããããã«å³ç€ºããŠã¿ãŸãããã{arulesViz}
ããã±ãŒãžã¯arules::apriori()
ã«ãã£ãŠçæãããrulesã¯ã©ã¹ãªããžã§ã¯ããããããããããã®ããã±ãŒãžã§ãã
ãå¯èŠåã®ããã®Rã³ãŒãïŒã¯ãªãã¯ã§è¡šç€ºïŒ
library(arulesViz) plot(rules, method = "grouped") sort(rules, by = "lift") %>% plot(method = "graph", control = list(type = "items"))
ããããèå¯ã§ãããã§ãããããã¡ãã£ãšå匷ããŠããïŒã®å 容ãå«ããŠååºŠææŠãããã§ããã
ð åè
- Nina Zumel and John Mount (2014). Practical Data Science With R. Manning Publications
- Association Rules - RDataMining.com: R and Data Mining
- Market Basket Analysis with R - Listen Data
- » An overview on Association Rules ekonlab.com
- R {arules} によるアソシエーション分析をちょっと詳しく <1> - StatsFragments
- 前処理なしのトランザクションデータを{arules}パッケージで読み込む方法 - 東京で働くデータサイエンティストのブログ
- アソシエーション分析+グラフ構造可視化 ({arules} + {arulesViz}) で教師あり学習の変数重要度を可視化する - 東京で働くデータサイエンティストのブログ
- 「使ってみたくなる統計」シリーズ 第2回:アソシエーション分析 | ビッグデータマガジン
ð» å®è¡ç°å¢
devtools::session_info() %>% { print(.$platform) .$packages %>% dplyr::filter(`*` == "*") %>% knitr::kable(format = "markdown") }
## setting value
## version R version 3.2.3 (2015-12-10)
## system x86_64, darwin13.4.0
## ui X11
## language En
## collate en_US.UTF-8
## tz Asia/Tokyo
## date 2016-02-21
package | * | version | date | source |
---|---|---|---|---|
arules | * | 1.3-1 | 2015-12-14 | CRAN (R 3.2.3) |
arulesViz | * | 1.1-0 | 2015-12-13 | CRAN (R 3.2.3) |
dplyr | * | 0.4.3.9000 | 2015-10-28 | Github () |
emoGG | * | 0.0.1 | 2015-11-28 | Github () |
ggplot2 | * | 2.0.0 | 2015-12-18 | CRAN (R 3.2.3) |
gridExtra | * | 2.0.0 | 2015-07-14 | CRAN (R 3.1.3) |
magrittr | * | 1.5 | 2016-01-13 | Github () |
Matrix | * | 1.2-3 | 2015-11-28 | CRAN (R 3.2.3) |
remoji | * | 0.1.0 | 2016-01-19 | Github () |
rvest | * | 0.3.1 | 2015-11-11 | CRAN (R 3.2.2) |
viridis | * | 0.3.2 | 2016-01-03 | Github () |
xml2 | * | 0.1.2 | 2015-09-01 | CRAN (R 3.2.0) |