{"id":7847,"date":"2019-09-12T15:41:24","date_gmt":"2019-09-12T07:41:24","guid":{"rendered":"http:\/\/www.finereport.com\/en\/?p=7847"},"modified":"2019-10-24T16:59:30","modified_gmt":"2019-10-24T08:59:30","slug":"four-basic-ways-to-automate-data-extraction","status":"publish","type":"post","link":"https:\/\/frg.fineres.com\/en\/2019\/09\/12\/four-basic-ways-to-automate-data-extraction\/","title":{"rendered":"Four Basic Ways to Automate Data Extraction"},"content":{"rendered":"\r\n<h3><strong>Where to Get the Data\uff1f<\/strong><\/h3>\r\n\r\n\r\n\r\n\r\nMany people have experienced data analysis, but in the first step of data analysis, how to obtain data is an important step. Data extraction is the basis of data analysis. Without data, analysis is meaningless. Sometimes, how many data sources we have, how much data we have, and how well the data quality will determine what happens to our output.\r\n\r\n\r\n\r\n\r\n\r\nWe need to consider that the trend of a\r\ndata is influenced by multiple dimensions. We need to collect as many data\r\ndimensions as possible through multi-source data collection, while ensuring the\r\nquality of the data so that high-quality data analysis results can be obtained.\r\n\r\n\r\n\r\n\r\n\r\nThere are many tools for data analysis, such as <a href=\"http:\/\/www.finereport.com\/en\/\">FineReport<\/a>, <a href=\"https:\/\/www.tableau.com\/\">Tableau<\/a>, <a href=\"https:\/\/powerbi.microsoft.com\/en-us\/\">Power BI<\/a>, etc.\r\n\r\n\r\n\r\n\r\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" class=\"wp-image-7879\" src=\"http:\/\/www.finereport.com\/en\/wp-content\/themes\/blogs\/images\/2019091201L.jpg\" alt=\"\" width=\"750\" height=\"500\" \/><\/figure>\r\n\r\n\r\n\r\n\r\nSo, from the data collection point of view, what are the data sources? I have divided the data source into the following four categories.\r\n\r\n\r\n\r\n\r\n\r\n<strong>1.Open data source (Government, university and enterprise)<\/strong>\r\n\r\n\r\n\r\n\r\n\r\n<strong>2.Crawler scraping (Web and application)<\/strong>\r\n\r\n\r\n\r\n\r\n\r\n<strong>3.Log collection (Frontend capture backend script)<\/strong>\r\n\r\n\r\n\r\n\r\n\r\n<strong>4.Sensor (Image, speed, thermal)<\/strong>\r\n\r\n\r\n\r\n\r\n\r\nThese\r\nfour types of data sources include: open data sources, crawler scraping, log\r\ncollection, and sensors. They all have their own characteristics.\r\n\r\n\r\n\r\n\r\n\r\nOpen data\r\nsources are generally industry-specific databases. For example, the US Census\r\nBureau opened up data on population information, regional distribution, and\r\neducation in the United States. In addition to the government, enterprises and\r\nuniversities will also open corresponding datasets. It is important to know\r\nthat many studies are based on open data sources. You need the same data set to\r\ncompare the quality of the algorithm.\r\n\r\n\r\n\r\n\r\n\r\nThe third\r\ntype of data source is log collection, which is the operation of the\r\nstatistical user. We can track the event at the front end, collect scripts and\r\nstatistics on the back end, and analyze the access of the website.\r\n\r\n\r\n\r\n\r\n\r\nFinally,\r\nthe sensor, which basically collects physical information. Such as images,\r\nvideos, or the speed, heat, pressure, etc. of an object. Since the main\r\nemphasis of this paper is data collection, this method will not be described.\r\n\r\n\r\n\r\n\r\n\r\nNow that\r\nwe know that there are four types of data sources, how do you collect them?\r\n\r\n\r\n\r\n\r\n<h3><strong>How to Use Open Data Source\uff1f<\/strong><\/h3>\r\n\r\n\r\n\r\n\r\nThe following table shows some authoritative open data source.\r\n\r\n\r\n\r\n\r\n<figure class=\"wp-block-image is-resized\"><img loading=\"lazy\" class=\"wp-image-7874\" src=\"http:\/\/www.finereport.com\/en\/wp-content\/uploads\/2019\/09\/20190912L.png\" alt=\"\" width=\"800\" height=\"580\" \/><\/figure>\r\n\r\n\r\n\r\n\r\nIf you are looking for a data source in a certain field, such as the financial sector, you can see if the government, universities, and enterprises have open data sources.\r\n\r\n\r\n\r\n\r\n<h3><strong>How to Use Crawlers\r\nto Scrape the Data<\/strong><\/h3>\r\n\r\n\r\n\r\n\r\nCrawlers scraping\r\nshould be the most common way, such as the evaluation data you want for the\r\nrestaurant. Of course, we must pay attention to copyright here.\r\n\r\n\r\n\r\n\r\n\r\nProblems,\r\nand many websites also have anti-crawling mechanisms.\r\n\r\n\r\n\r\n\r\n\r\nThe most\r\nstraightforward way is to write crawler code in Python, of course, you need to learn\r\nbasic syntax of Python. In addition, PHP can also write a crawler, but it is\r\nnot as good as Python, especially when it comes to multi-threaded operations.\r\n\r\n\r\n\r\n\r\n\r\nIn a\r\nPython crawler, you basically go through three processes.\r\n\r\n\r\n\r\n\r\n\r\n<strong>1. Crawl content using Requests.<\/strong> We can use the Requests library to crawl web page information. The Requests library can be said to be an excellent tool for Python crawlers, which is Python&#8217;s HTTP library. It is very convenient to crawl the data in the webpage through this library, which can save us a lot of time.\r\n\r\n\r\n\r\n\r\n\r\n<strong>2. Parse the content using XPath.<\/strong> XPath is an acronym for XML Path. It is a language used to determine the location of a part of an XML document and is often used as a small query language in development. XPath can be indexed by elements and attributes.\r\n\r\n\r\n\r\n\r\n\r\n<strong>3. Save your data with Pandas.<\/strong> Pandas is an advanced data structure that makes data analysis easier, and we can use Pandas to save crawled data. Finally, it is written to the database such as XLS or MySQL through Pandas.\r\n\r\n\r\n\r\n\r\n\r\nRequests,\r\nXPath, and Pandas are three useful tools for Python. Of course, there are many other\r\ntools to write Python crawlers, such as Selenium, PhantomJS, or Puppteteer.\r\n\r\n\r\n\r\n\r\n\r\nIn addition, we can also crawl the webpage information without programming. Here are three commonly used crawlers.\r\n\r\n\r\n\r\n\r\n<h4><strong>&#8211;<\/strong><a href=\"https:\/\/www.import.io\/\"><strong>import.io<\/strong><\/a><\/h4>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img class=\"wp-image-7875\" src=\"http:\/\/www.finereport.com\/en\/wp-content\/themes\/blogs\/images\/2019091203L.png\" alt=\"\" \/><\/figure>\r\n\r\n\r\n\r\n\r\nThe most compelling and everyone thinks that the best feature is called &#8220;Magic&#8221;, this feature allows users to automatically extract data by entering only one web page, without any other settings.\r\n\r\n\r\n\r\n\r\n<h4><strong>&#8211;<\/strong><a href=\"https:\/\/www.parsehub.com\/\"><strong>parsehub<\/strong><\/a><\/h4>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img class=\"wp-image-7876\" src=\"http:\/\/www.finereport.com\/en\/wp-content\/themes\/blogs\/images\/2019091204L.png\" alt=\"\" \/><\/figure>\r\n\r\n\r\n\r\n\r\nParseHub\r\nis a web-based crawling client tool that supports JavaScript rendering, Ajax\r\ncrawling, cookies, Sessions, etc. The application can analyze and retrieve data\r\nfrom a website and convert it into meaningful data. It can also use machine\r\nlearning techniques to identify complex documents and export them to JSON, CSV,\r\nGoogle Sheets, and more.\r\n\r\n\r\n\r\n\r\n<h4><strong>&#8211;<\/strong><a href=\"https:\/\/webscraper.io\/\"><strong>Web Scraper<\/strong><\/a><\/h4>\r\n\r\n\r\n\r\n<figure class=\"wp-block-image\"><img class=\"wp-image-7877\" src=\"http:\/\/www.finereport.com\/en\/wp-content\/themes\/blogs\/images\/2019091205L.png\" alt=\"\" \/><\/figure>\r\n\r\n\r\n\r\n\r\nWeb Scraper is a Chrome extension that has been installed by more than 200,000 people. It supports point-and-click data grabbing, supports dynamic page rendering, and is optimized for JavaScript, Ajax, drop-down drag, and pagination, with a full selector system, and supports data export to CSV and other formats. In addition, they also have their own Cloud Scraper, which supports scheduled tasks, API management, and proxy switching.\r\n\r\n\r\n\r\n\r\n<h3><strong>How to Use Log Collection Tool<\/strong><\/h3>\r\n\r\n\r\n\r\n\r\nThe biggest role\r\nof log collection is to improve the performance of the system by analyzing user\r\naccess conditions, thereby increasing the system load. Timely discovery of\r\nsystem load bottlenecks can also facilitate technical personnel to optimize the\r\nsystem based on actual user access conditions.\r\n\r\n\r\n\r\n\r\n\r\nThe log records\r\nthe entire process of the user&#8217;s visit to the website: who is at what time,\r\nthrough what channels (such as search engine, URL input), what operations have\r\nbeen performed; whether the system has generated errors; even the user&#8217;s IP,\r\nHTTP request time, user agent, etc.\r\n\r\n\r\n\r\n\r\n\r\nHere log\r\ncollection can be divided into two forms.\r\n\r\n\r\n\r\n\r\n\r\n<strong>1. Collected through the web server. <\/strong>For example, httpd, Nginx, and Tomcat all have their own logging function. At the same time, many Internet companies have their own massive data collection tools, which are mostly used for system log collection, such as Chukwa of Hadoop, Flume of Cloudera, Scribe of Facebook, etc. These tools are distributed architectures that meet hundreds of MB of log data collection and transmission requirements per second.\r\n\r\n\r\n\r\n\r\n\r\n<strong>2. Customize user behavior.<\/strong> Such as listening to user behavior with JavaScript code, AJAX asynchronous request backend logging, and more.\r\n\r\n\r\n\r\n\r\n<h3><strong>What is Event Tracking?<\/strong><\/h3>\r\n\r\n\r\n\r\n\r\n<strong>Event tracking is a key step in log collection. <\/strong>\r\n\r\n\r\n\r\n\r\n\r\nEvent tracking is\r\nto collect the corresponding information and report it at the location where you\r\nset. For example, the access status of a page, including user information,\r\ndevice information; or the user&#8217;s operation behavior on the page, including the\r\nlength of staying time. Each event tracking is like a camera. It collects user\r\nbehavior data and performs multi-dimensional analysis of the data, which can\r\ntruly restore the user usage scenarios and user needs.\r\n\r\n\r\n\r\n\r\n\r\nSo how do we track\r\ndifferent event?\r\n\r\n\r\n\r\n\r\n\r\nEvent tracking is\r\nto embed statistical code where you need statistics, of course, the implant\r\ncode can be written by yourself, or you can use third-party statistical tools.\r\nI have talked about the principle of &#8220;do not repeat producing a\r\nwheel&#8221;. For the event tracking toolsg, the market is quite mature. I can recommend\r\nsome to you. Three-party tools such as Google Analysis, Talkingdata, and more.\r\nThey all use the front-end tracking method, and then the user&#8217;s behavior data\r\ncan be seen in the third-party tools. But if we want to see deeper user\r\nbehavior, we need to customize the event tracking settings.\r\n\r\n\r\n\r\n\r\n\r\nTo sum up, log collection helps us understand the user&#8217;s operational data and is suitable for scenarios such as operation and maintenance monitoring, security auditing, and business data analysis. A typical web server will have its own logging capabilities, or you can use Flume to collect, aggregate, and transfer large volumes of log data from different server clusters. Of course, we can also use third-party statistical tools or custom buried points to get the statistics we want.\r\n\r\n\r\n\r\n\r\n<h3><strong>Conclusion<\/strong><\/h3>\r\n\r\n\r\n\r\n\r\nData extraction is the key to <a href=\"http:\/\/www.finereport.com\/en\/data-analysis\/data-analysis-practice-guide-how-to-begin.html\">data analysis<\/a>. Sometimes we use Python web crawlers to solve the problem. In fact, data collection methods are very wide. Some can directly use open data sources, such as the price and transaction data of Bitcoin history. You can download directly from <a href=\"https:\/\/www.kaggle.com\/\">Kaggle<\/a> and don&#8217;t need to crawl it yourself.\r\n\r\n\r\n\r\n\r\n\r\nOn the other hand, according to our needs, the data that needs to be collected is also different, such as the transportation industry, data collection will be related to camera or speedometer. For operations personnel, log collection and analysis are key point. So we need to choose the right acquisition tool for a specific business scenario.\r\n\r\n\r\n\r\n\r\n\r\nIf you want to know more about data analysis, just follow\u00a0<a href=\"https:\/\/www.facebook.com\/finereport\/\"><em>FineReport Reporting Software<\/em><\/a><em>.<\/em>\r\n\r\n\r\n\r\n\r\n<h2 id=\"id-190801-\u5b98\u7f51-Top10MapTypesinDataVisualization-Youmightalsobeinterestedin\u2026\"><strong>You\u00a0might also be interested in\u2026<\/strong><\/h2>\r\n\r\n\r\n\r\n\r\n<a href=\"http:\/\/www.finereport.com\/en\/data-visualization\/data-visualization-31-tools-that-you-need-know.html\">Data Visualization: 31 Tools that You Need Know<\/a>\r\n\r\n\r\n\r\n\r\n\r\n<a href=\"http:\/\/www.finereport.com\/en\/data-visualization\/top-16-types-of-chart-in-data-visualization.html\">Top 16 Types of Chart in Data Visualization<\/a>\r\n\r\n\r\n\r\n\r\n\r\n<a href=\"http:\/\/www.finereport.com\/en\/data-visualization\/how-beginners-make-a-cool-dashboard%ef%bc%9f.html\">How beginners make a cool dashboard\uff1f<\/a>\r\n\r\n","protected":false},"excerpt":{"rendered":"<p>Data analysis tutorials of how to extract high quality data.<\/p>\n","protected":false},"author":1,"featured_media":8186,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[161],"tags":[151],"_links":{"self":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts\/7847"}],"collection":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/comments?post=7847"}],"version-history":[{"count":20,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts\/7847\/revisions"}],"predecessor-version":[{"id":8568,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/posts\/7847\/revisions\/8568"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/media\/8186"}],"wp:attachment":[{"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/media?parent=7847"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/categories?post=7847"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/frg.fineres.com\/en\/wp-json\/wp\/v2\/tags?post=7847"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}