Python e RSS

Published on May 19, 2013.

Python possui uma biblioteca/módulo que implementa um parser para RSS¹.

Neste post iremos mostrar como utilizar essa biblioteca.

Lendo o feed a partir de uma URL

Para carregar o feed disponível em http://www.feedforall.com/sample.xml utilizamos:

>>> import feedparser
>>> d = feedparser.parse('http://www.feedforall.com/sample.xml')
>>> type(d)
<class 'feedparser.FeedParserDict'>
>>> d.keys()
dict_keys(['feed', 'status', 'updated', 'updated_parsed', 'encoding', 'bozo', 'headers', 'etag', 'href', 'version', 'entries', 'namespaces'])

Como você pode observer, o retorno é um dicionário.

Analizando o feed

Alguns elementos comuns do feed são:

status,
título,
autor,
link,
descrição,
data de atualização.

Para recuperar essas informações utilizamos:

>>> d.status
200
>>> d.feed.title
'FeedForAll Sample Feed'
>>> d.feed.author
'marketing@feedforall.com'
>>> d.feed.link
'http://www.feedforall.com/industry-solutions.htm'
>>> d.feed.description
'RSS is a fascinating technology. The uses for RSS are expanding daily. Take a closer look at how various industries are using the benefits of RSS in their businesses.'
>>> d.updated
'Tue, 19 Oct 2004 12:38:55 GMT'
>>> d.feed.updated
'Tue, 19 Oct 2004 13:38:55 -0400'

Analizando itens do feed

Os itens do feed são armazenados como uma lista de dicionários para o valor associado a chave entries. Determinamos o número de itens utilizando:

>>> len(d.entries)

Para cada entrada temos as seguintes informações:

título,
sumário,
link,
tags,
última atualização.

Essas informações são acessadas utilizando:

>>> d.entries[0].title
'RSS Solutions for Restaurants'
>>> d.entries[0].summary
'<b>FeedForAll </b>helps Restaurant\'s communicate with customers. Let your customers know the latest specials or events.<br />\n<br />\nRSS feed uses include:<br />\n<i><font color="#FF0000">Daily Specials <br />\nEntertainment <br />\nCalendar of Events </i></font>'
>>> d.entries[0].link
'http://www.feedforall.com/restaurant.htm'
>>> d.entries[0].tags
[{'term': 'Computers/Software/Internet/Site Management/Content Management', 'scheme': 'www.dmoz.com', 'label': None}]
>>> d.entries[0].updated
'Tue, 19 Oct 2004 11:09:11 -0400'

Parte das informações recuperadas acima for limpas/sanitizadas.

Entradas como texto plano

Para converter as entradas em HTML para texto plano podemos utilizar o biblioteca/módulo lxml². :

>>> from lxml import html
>>> d.entries[0].summary
'<b>FeedForAll </b>helps Restaurant\'s communicate with customers. Let your customers know the latest specials or events.<br />\n<br />\nRSS feed uses include:<br />\n<i><font color="#FF0000">Daily Specials <br />\nEntertainment <br />\nCalendar of Events </i></font>'
>>> doc = html.fromstring(d.entries[0].summary)
>>> doc.text_content()
"FeedForAll helps Restaurant's communicate with customers. Let your customers know the latest specials or events.\nRSS feed uses include:Daily Specials \nEntertainment \nCalendar of Events "

Outra opção é utilizar html2text <https://github.com/aaronsw/html2text>.

Referencias

Tags: