Scrape More with Less Codes

Meta Info.

  • Author: [Pili Hu](http://hupili.net/)
  • Repo: [Easy Scraping in Python](https://github.com/hupili/workshop-easy-scraping)
  • Demo: scrapely, python-readability, pyQuery, pandas, httpie, etc

Prerequisites:

  • Python3
  • pip install -r reuiqrements.txt

FAQ about the Env

Q: What is this webpage you are using?

A: IPython Notebook

In [1]:
# This is the input block -- a full bone Python Shell
print('Look: I will be shown on output block')
Look: I will be shown on output block

Useful tricks in IPython notebook

In [2]:
import pprint
from IPython.core.display import HTML
In [3]:
HTML('Logo of Initium Lab: <img src="%s">' % 'http://initiumlab.com/favicon-32x32.png')
Out[3]:
Logo of Initium Lab:
In [4]:
# Display any HTML easily
my_html = '''
I'm going to show you:
<ul>
    <li> PyReadability </li>
    <li> PyQuery </li>
    <li> ... </li>
</ul>
'''
HTML(my_html)
Out[4]:
I'm going to show you:
  • PyReadability
  • PyQuery
  • ...

A small hack to allow longer output area

In [5]:
%%javascript
//IPython.OutputArea.auto_scroll_threshold = 9999;
IPython.OutputArea.prototype._should_scroll = function(){return false;}

Why Scraping?

In [6]:
# I'm going to insert some slides here
from IPython.core.display import Image
In [7]:
Image('assets/venn-skillset.png')
Out[7]:
In [8]:
Image('assets/workflow-highlight-data-collection.png')
Out[8]: