Examine all of the data contained in quotes to have a better understanding of the quotes and determine what we can do to analyze and bias the trend.
Focus our work mainly on the quotes in NewYork Times to analyze the trend of it and make some comparision with other vendors.
Expand the Timeline
Explore the NewYork times from 2015 to 2020, and make a conclusion of the bias based on the previous work.
Data exploration of the quotes data
Before we conduct an assessment of the New York Times data, we’d like to examine all of the data contained in quotes. Thus, we may gain a better understanding of the quotes and determine what we can do to analyze and bias the trend. We focus our work mainly on the trend of topics, speakers, vendors and also explore the NLP method on topic detection and sentiment analysis.
We begin by examining the data to determine the relationship between the topic and other characteristics such as vendors and speakers. For the former, we'd like to have the topics discussed with the vendors who primarily deliver them. Additionally, the same conditions apply to the speakers and themes; in this approach, we may observe which topics are preferred by certain individuals. This manner, we can get a sense of the vendors' and users' preferred topics. After examining the themes and sentiments covered by the NewYork Times, we may utilize this data to determine whether the NewYork Times has a bias.
In this section, we try to visualize how different topics of New York Times are connected to different vendors and speakers. We notice that in the given dataset, we have the urls of the quotes. The URLs look something like this: https://www.nytimes.com/2019/04/17/realestate/house-hunting-in-hong-kong.html?partner=rss&emc=rss . It is easy to see that the URLs of NYTimes follows a set pattern from which we can extract topics like real estate in this case. To understand how we do this further, please refer to the uploaded code.
The following graph indicates which news outlets quote similar to different NYT topics. The green circles represent the topics and blue circles represent the websites.
The following is a similar graph of speakers with topics.
The picture below is an example for the topics trend overtime in 2017. The topics is extracted by BERT in natural language processing . First, we could find the top 10 topics in 2017 concerned by people. Among them the topic related to the student-teacher-school is most attracted, followed by the topics of films and movies. The music is the third largest topic in 2017.
We observe the trends related to the major events of the year. While there are some omnipresent values such as movies, sports, education, we noticed that during the election years and in the aftermath, politics became more popular(eg. Trump,white house etc.). Similarly, for 2020 there were many more articles related to health, covid disease etc. We also see that the number of articles with positive/negative sentiment are not too different, which shows that the overall sentiment is somewhat neutral.
The following is an interactive map for us to find the topics and it's popularity.
We visualize the words related to the sentiment, and divide them into four groups. The positive, negative, netrual and compound sentiment words. The following picture show the positive and negative words we extract from 2017.
Afterwards, we use Flair Sentiment Analyzer to compare the count words related to this two sentiments. We could see that the positive sentiment is higher than the negative sentiments.
The following picture shows the most frequent topics and the example of distribution on speakers' profession in these topics.
We can remark that the "politician" occupation is over represented in a global scope and in the most frequent topics. But we also observe that the media are not biased to the most famous politicians eg. Trump. We can conclude that most quotes are related to politics and the government.
We take the same method in the general analysis to data in NewYork Times, with the purpose to compare the difference between them. The chart below illustrates the topic trend and mood surrounding the newyork times quotations.
Here is the topic trend for quetos provided by NewYork Times 2019. Some top topics are the same as that in general research, while there left some difference we could discover. Rather than other vendors, NewYork Times focus more on the politics, religion, for it give more reports on gender, China, church. While some livelihood issues such as traffic, agriculture are attached less importance on NewYork Times report.
For the sentiment words extracted, the positive phases for NewYork Times are similar to those in general vendors. For negative words, it is obvious that other vendors express more on the violence and crime, while NewYork Times concerned more about racism and war. As we conclude before, NewYork Times tends to be more political. The example of the sentiment trends is shown in the following pictures.
Preference of NewYork Times and the bias
The following is a list of Topic preferences for the New York Times as compared to the general dataset. We can clearly see that New York Times is more or less at par with the general consensus on the different topics.
This is a list of Speaker preferences for the New York Times as compared to the general dataset. We see a similar pattern as with the case of Topic preference that the most popular speakers of NYT are the same as the general consensus.
In our sentiment analysis, we randomly choose 200,000 quotes for New York Times as well as the general dataset. We see that NYT is at par with the general distribution of positive and negative sentiment. Further, it can be seen that both the distributions are unbiased in terms of sentiment.
To further analyse NYT, we extract the topics of NYT articles from their URLs (to keep the topics in the format NYT keeps). We notice that even at the topic level, NYT is fairly unbiased/slightly positive biased across all the different topics.
Now that we have established that NYT like all the general media outlets is quite unbiased in it’s reporting. We would like to observe whether NYT over-represents the interests of a few individuals. To do this, we perform a small experiment. We define the popularity of a quote by the number of media outlets that have used a particular quote. Hence, if there are 30 URLs with the same quote, we would define the popularity of the quote as 30.
We would like to observe whether the quotes given by the most influential speakers become a lot more popular than the average. To do this, we plot the average popularity of a quote given by a top 10 speaker for a particular topic to the average popularity of a particular topic. We notice that for topics like Sports and Arts, the average popularity of influential speakers is the same as the overall popularity. But the popularity of the most influential speakers in topics like world affairs, business and opinion is a lot higher. Hence, we notice that while media outlets are unbiased when it comes to sentiments of their reports, they are heavily biased towards a few individual speakers.
General Exploration on topics and sentiments
Focus on the analysis for NewYork Times
Extend the timeline for data analysis