[Getting and Cleaning data] Week 2
15016 단어 statisticsRcourseradatascience
For more detail, down the html file here.
Week 2
Reading data from MySQL
What is MySQL? SQL is short for Structured Query Language and MySQL is the world’s most popular database.(Further information can be found in wiki page and mysql page)
And why using it is the important point to keep in mind here is that as a data scientist what role that you will have is likely to collect data from a database, and maybe later you’re going to put some data back in it. But usually, the basic data collection has already been formed before you get there, so you usually be handed a database and trying having to get data out of it.
Now we will focus on how to access MySQL database using R. Firstly you need to install R package
RMySQL
. The instruction can be found in my blog How to install RMySQL
package on Windows. FOn a Mac, general way install.packages("RMySQL")
is OK. Then we will access the database and collect some information about it. dbConnect
function. library(RMySQL)
ucscDb "genome", host="genome-mysql.cse.ucsc.edu")
dbGetQuery
function. (Here the result contains all the databases in this sever.) result "show databases;")
dbDisconnect
function.(It is very important that whenever you’ve done analysis data or collecting data from MySQL server that you disconnect from the server.) dbDisconnect(ucscDb)
Or you can use
dbSendQuery
function in step 2. The difference for dbGetQuery
and dbSendQuery
is that 1
dbSendQuery
only submits and synchronously executes the SQL statement to the database engine. It does not extracts any records — for that you need to use the function dbFetch, and then you must call dbClearResult when you finish fetching the records you need. 2 dbGetQuery
comes with a default implementation that calls dbSendQuery
, then if dbHasCompleted
is TRUE, it uses fetch to return the results. on.exit
is used to ensure the result set is always freed by dbClearResult
. Subclasses should override this method only if they provide some sort of performance optimisation. The above codes gives all databases in the connection. If we want to focus on a specific database, we need to use the argument dbname for the name of database.
dbConnect
function with argument dbname . hg19 "genome", dbname="hg19", host="genome-mysql.cse.ucsc.edu")
allTables
length(allTables) # NO. of tables or data frames in database hg19
allTables[1:4]
dbListFields(hg19, "affyU133Plus2")
num "select count(*) from affyU133Plus2")
num
oldw "warn")
options(warn = -1)
affyData "affyU133Plus2")
head(affyData)
options(warn = oldw)
dbSentQuery
function. oldw "warn")
options(warn = -1)
query "select * from affyU133Plus2 where misMatches between 1 and 3")
options(warn = oldw)
affyMis
affyMisSmall 10)
dbClearResult(query)
dbDisconnect(hg19)
Reading data from HDF5
What is HDF? HDF stands for hierarchical data format and HDF5 is a data model, library, and file format for storing and managing data.(More details can br found in here)
Now we begin to play with HDF5.
rhdf5
package, which is installed through bioconductor.(This will install packages from Bioconductor, which is used for genomics but also has good “big data” packages) #source("http://bioconductor.org/biocLite.R")
#biocLite("rhdf5")
library(rhdf5)
created "./data/example.h5")
created "./data/example.h5", "foo")
created "./data/example.h5", "baa")
created "./data/example.h5", "foo/foobaa")
h5ls("./data/example.h5")
# matrix
A 1:10, 5, 2)
# write the matrix to a particular group
h5write(A, "./data/example.h5", "foo/A")
# multidimension array
B 0.1, 2, by=0.2), dim = c(5, 2, 2))
# add attributes
attr(B, "scale") "liter"
# write the array to a particular group
h5write(B, "./data/example.h5", "foo/foobaa/B")
h5ls("./data/example.h5")
# data frame
df 1L:5L, seq(0, 1, length.out = 5), c("ab", "cde", "fghi", "a", "s"), stringsAsFactors = FALSE)
# write the data frame to a particular group
h5write(df, "./data/example.h5", "df")
h5ls("./data/example.h5")
# read HDF5 data
readA "./data/example.h5", "foo/A")
readB "./data/example.h5", "foo/foobaa/B")
readdf "./data/example.h5", "df")
readA
h5write(c(12, 13, 15), "./data/example.h5", "foo/A", index = list(1:3, 1))
h5read("./data/example.h5", "foo/A")
Reading data from website
Webscraping: Programatically extracting data from HTML code of website
(1)
readLines
function usl
function con "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")
htmlCode
close(con)
#htmlCode
(2)
XML
package library(XML)
url "http://scholar.google.com/citations?user=HI I6C0AAAAJ&hl=en"
html
xpathSApply(html, "//title", xmlValue)
xpathSApply(html, "//td[@id='col-citedby']", xmlValue)
(maybe the last script is no longer right, but it gives us a good view to see how to extract information from website.)
(3)
httr
package library(httr)
url "http://scholar.google.com/citations?user=HI I6C0AAAAJ&hl=en"
html2
content2 as = "text")
parseHtml
xpathSApply(parseHtml, "//title", xmlValue)
pg1 "http://httpbin.org/basic-auth/user/passwd")
pg1
pg2 "http://httpbin.org/basic-auth/user/passwd", authenticate("user", "passwd"))
pg2
names(pg2)
Reading data from API
What is API? API stands for application programming interfaces through which softwares can interact with each other. For example, most internet companies, like twitter or Facebook will have an application programming interface, where you can download data. For example, you can get data about which users are tweeting or what they are tweeting about, where you can get information about what people are posting on Facebook.
library(httr)
myapp oauth_app("twitter",
key="YourConsumerKey",
secret="YourConsumerSecret")
sig = sign_oauth1.0(myapp,
token="YourAccessToken",
token_secret="YourAccessTokenSecret")
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)
# use the content function to extract the information
json1 1,1:4]
이 내용에 흥미가 있습니까?
현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:
SPSS Statistics 27에서 "효과량"출력최근의 학술논문에서는 실험에서 유의한 차이가 있는지 여부를 나타내는 p-값뿐만 아니라 그 차이에 얼마나 효과가 있는지를 나타내는 효과량의 제시가 요구되고 있다. 일반적으로 두 가지 차이점은 효과량을 계산할 때 분산을...
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.