Pandas and DataFrames

In this lesson we will be exploring data analysis using Pandas.

  • College Board talks about ideas like
    • Tools. "the ability to process data depends on users capabilities and their tools"
    • Combining Data. "combine county data sets"
    • Status on Data"determining the artist with the greatest attendance during a particular month"
    • Data poses challenge. "the need to clean data", "incomplete data"
  • From Pandas Overview -- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.

DataFrame

'''Pandas is used to gather data sets through its DataFrames implementation'''
import pandas as pd

Cleaning Data

When looking at a data set, check to see what data needs to be cleaned. Examples include:

  • Missing Data Points
  • Invalid Data
  • Inaccurate Data

Run the following code to see what needs to be cleaned

df = pd.read_json('files/grade.json')

print(df)
# What part of the data set needs to be cleaned?
      # "Junior", "9th Grade" need to be cleaned
# From PBL learning, what is a good time to clean data?  Hint, remember Garbage in, Garbage out?
      # When data is enterd by users
      # ex. if user enters something with numbers into a field called "Name"
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
2         578             12  2.78
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
6         145             12  2.95
7         167             10  3.90
8         235      9th Grade  3.15
9         nil              9  2.80
10        469             11  3.45
11        456             10  2.75

Extracting Info

Take a look at some features that the Pandas library has that extracts info from the dataset

DataFrame Extract Column

print(df[['GPA']])

print()

#try two columns and remove the index from print statement
print(df[['Student ID','GPA']].to_string(index=False))
     GPA
0   3.57
1   4.00
2   2.78
3   3.45
4   4.75
5   3.33
6   2.95
7   3.90
8   3.15
9   2.80
10  3.45
11  2.75

Student ID  GPA
       123 3.57
       246 4.00
       578 2.78
       469 3.45
       324 4.75
       313 3.33
       145 2.95
       167 3.90
       235 3.15
       nil 2.80
       469 3.45
       456 2.75

DataFrame Sort

print(df.sort_values(by=['GPA']))

print()

#sort the values in reverse order
print(df.sort_values(by=['GPA'], ascending=False))

DataFrame Selection or Filter

print(df[df.GPA > 3.00])
   Student ID Year in School   GPA
0         123             12  3.57
1         246             10  4.00
3         469             11  3.45
4         324         Junior  4.75
5         313             20  3.33
7         167             10  3.90
8         235      9th Grade  3.15
10        469             11  3.45

DataFrame Selection Max and Min

print(df[df.GPA == df.GPA.max()])
print()
print(df[df.GPA == df.GPA.min()])
  Student ID Year in School   GPA
4        324         Junior  4.75

   Student ID Year in School   GPA
11        456             10  2.75

Create your own DataFrame

Using Pandas allows you to create your own DataFrame in Python.

Python Dictionary to Pandas DataFrame

import pandas as pd

#the data can be stored as a python dictionary
dict = {
  "calories": [420, 380, 390, 387],
  "duration": [50, 40, 45, 35]
}
#stores the data in a data frame
print("-------------Dict_to_DF------------------")
df = pd.DataFrame(dict)
print(df)

print("----------Dict_to_DF_labels--------------")

#or with the index argument, you can label rows.
df = pd.DataFrame(dict, index = ["day1", "day2", "day3", "day4"])
print(df)
-------------Dict_to_DF------------------
   calories  duration
0       420        50
1       380        40
2       390        45
3       387        35
----------Dict_to_DF_labels--------------
      calories  duration
day1       420        50
day2       380        40
day3       390        45
day4       387        35

Examine DataFrame Rows

print("-------Examine Selected Rows---------")
#use a list for multiple labels:
print(df.loc[["day1", "day3"]])

#refer to the row index:
print("--------Examine Single Row-----------")
print(df.loc["day1"])
-------Examine Selected Rows---------
      calories  duration
day1       420        50
day3       390        45
--------Examine Single Row-----------
calories    420
duration     50
Name: day1, dtype: int64

Pandas DataFrame Information

print(df.info())
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, day1 to day4
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   calories  4 non-null      int64
 1   duration  4 non-null      int64
dtypes: int64(2)
memory usage: 268.0+ bytes
None

Example of larger data set

Pandas can read CSV and many other types of files, run the following code to see more features with a larger data set

import pandas as pd

#read csv and sort 'Duration' largest to smallest
df = pd.read_csv('files/data.csv').sort_values(by=['Duration'], ascending=False)

print("--Duration Top 10---------")
print(df.head(10))

print("--Duration Bottom 10------")
print(df.tail(10))
--Duration Top 10---------
     Duration  Pulse  Maxpulse  Calories
69        300    108       143    1500.2
79        270    100       131    1729.0
109       210    137       184    1860.4
60        210    108       160    1376.0
106       180     90       120     800.3
90        180    101       127     600.1
65        180     90       130     800.4
61        160    110       137    1034.4
62        160    109       135     853.0
67        150    107       130     816.0
--Duration Bottom 10------
     Duration  Pulse  Maxpulse  Calories
68         20    106       136     110.4
100        20     95       112      77.7
89         20     83       107      50.3
135        20    136       156     189.0
94         20    150       171     127.4
95         20    151       168     229.4
139        20    141       162     222.4
64         20    110       130     131.4
112        15    124       139     124.2
93         15     80       100      50.5

APIs are a Source for Writing Programs with Data

3rd Party APIs are a great source for creating Pandas Data Frames.

  • Data can be fetched and resulting json can be placed into a Data Frame
  • Observe output, this looks very similar to a Database
'''Pandas can be used to analyze data'''
import pandas as pd
import requests

def fetch():
    '''Obtain data from an endpoint'''
    url = "https://flask.nighthawkcodingsociety.com/api/covid/"
    fetch = requests.get(url)
    json = fetch.json()

    # filter data for requirement
    df = pd.DataFrame(json['countries_stat'])  # filter endpoint for country stats
    print(df.loc[0:5, 'country_name':'deaths']) # show row 0 through 5 and columns country_name through deaths
    
fetch()
  country_name       cases     deaths
0          USA  82,649,779  1,018,316
1        India  43,057,545    522,193
2       Brazil  30,345,654    662,663
3       France  28,244,977    145,020
4      Germany  24,109,433    134,624
5           UK  21,933,206    173,352

Hacks

AP Prep

  • Add this Blog to you own Blogging site. In the Blog add notes and observations on each code cell.
  • In blog add College Board practice problems for 2.3.

The next 4 weeks, Teachers want you to improve your understanding of data. Look at the blog and others on Unit 2. Your intention is to find some things to differentiate your individual College Board project.

  • Create or Find your own dataset. The suggestion is to use a JSON file, integrating with your PBL project would be Fambulous.
When choosing a data set, think about the following:- Does it have a good sample size?    - Is there bias in the data?
- Does the data set need to be cleaned?
- What is the purpose of the data set?
- ...
  • Continue this Blog using Pandas extract info from that dataset (ex. max, min, mean, median, mode, etc.)

Hack Helpers

Here is how Mort started on this assignment by asking ChatGPT ... Regarding Python Pandas, what are some data sets that would be good for learning Pandas?

  • There are many data sets that are suitable for learning pandas, depending on your interests and the skills you want to develop. Here are some suggestions:- Titanic Dataset: This is a classic dataset for data analysis and machine learning, and is often used as an introduction to pandas. The dataset contains information about passengers on the Titanic, including their demographics, ticket class, and survival status.

    • Iris Dataset: This is another classic dataset that is often used in machine learning and data analysis courses. The dataset contains information about iris flowers, including their sepal length, sepal width, petal length, and petal width.

    • NBA Player Stats: This dataset contains information about NBA players, including their stats for various games, their positions, and their salaries. It is a great dataset for learning how to clean and manipulate data using pandas.

    • Housing Prices Dataset: This dataset contains information about housing prices in a particular city or region, including variables such as the number of bedrooms, square footage, and price. It is a great dataset for learning how to manipulate and analyze numerical data using pandas.

    • COVID-19 Dataset: This dataset contains information about the COVID-19 pandemic, including the number of cases, deaths, and recoveries in various countries and regions. It is a great dataset for learning how to manipulate and analyze time-series data using pandas.

    • World Bank Dataset: This dataset contains information about economic indicators from various countries, such as GDP, population, and poverty rates. It is a great dataset for learning how to manipulate and analyze large datasets using pandas.

  • These are just a few examples of the many datasets that are suitable for learning pandas. The best dataset for you will depend on your interests and the specific skills you want to develop.

    Follow up question, "where can I find Titanic data set?"

  • The Titanic dataset is a popular dataset for data analysis and machine learning, and it can be found on various websites and data repositories. Here are a few places where you can find the Titanic dataset:- Kaggle: The Titanic dataset is available on Kaggle, which is a popular platform for data scientists and machine learning enthusiasts. You can download the dataset from the Kaggle website after creating an account.

    • UCI Machine Learning Repository: The Titanic dataset is also available on the UCI Machine Learning Repository, which is a collection of datasets that are commonly used for machine learning research. You can download the dataset from the UCI Machine Learning Repository website.

    • Seaborn library: If you have the Seaborn library installed in your Python environment, you can load the Titanic dataset directly from the library using the following code:

      import seaborn as sns
      titanic_data = sns.load_dataset('titanic')
      

Titanic Data

Look at a sample of data.

import seaborn as sns

# Load the titanic dataset
titanic_data = sns.load_dataset('titanic')

print("Titanic Data")


print(titanic_data.columns) # titanic data set

print(titanic_data[['survived','pclass', 'sex', 'age', 'sibsp', 'parch', 'class', 'fare', 'embark_town']]) # look at selected columns

Use Pandas to clean the data. Most analysis, like Machine Learning or even Pandas in general like data to be in standardized format. This is called 'Training' or 'Cleaning' data.

# Preprocess the data
from sklearn.preprocessing import OneHotEncoder


td = titanic_data
td.drop(['alive', 'who', 'adult_male', 'class', 'embark_town', 'deck'], axis=1, inplace=True)
td.dropna(inplace=True)
td['sex'] = td['sex'].apply(lambda x: 1 if x == 'male' else 0)
td['alone'] = td['alone'].apply(lambda x: 1 if x == True else 0)

# Encode categorical variables
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(td[['embarked']])
onehot = enc.transform(td[['embarked']]).toarray()
cols = ['embarked_' + val for val in enc.categories_[0]]
td[cols] = pd.DataFrame(onehot)
td.drop(['embarked'], axis=1, inplace=True)
td.dropna(inplace=True)

print(td)

The result of 'Training' data is making it easier to analyze or make conclusions. In looking at the Titanic, as you clean you would probably want to make assumptions on likely chance of survival.

This would involve analyzing various factors (such as age, gender, class, etc.) that may have affected a person's chances of survival, and using that information to make predictions about whether an individual would have survived or not.

  • Data description:- Survival - Survival (0 = No; 1 = Yes). Not included in test.csv file. - Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

    • Name - Name
    • Sex - Sex
    • Age - Age
    • Sibsp - Number of Siblings/Spouses Aboard
    • Parch - Number of Parents/Children Aboard
    • Ticket - Ticket Number
    • Fare - Passenger Fare
    • Cabin - Cabin
    • Embarked - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
  • Perished Mean/Average

print(titanic_data.query("survived == 0").mean())
  • Survived Mean/Average
print(td.query("survived == 1").mean())

Survived Max and Min Stats

print(td.query("survived == 1").max())
print(td.query("survived == 1").min())

Machine Learning

From Tutorials Point%20is,a%20consistence%20interface%20in%20Python). Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.> Description from ChatGPT. The Titanic dataset is a popular dataset for data analysis and machine learning. In the context of machine learning, accuracy refers to the percentage of correctly classified instances in a set of predictions. In this case, the testing data is a subset of the original Titanic dataset that the decision tree model has not seen during training......After training the decision tree model on the training data, we can evaluate its performance on the testing data by making predictions on the testing data and comparing them to the actual outcomes. The accuracy of the decision tree classifier on the testing data tells us how well the model generalizes to new data that it hasn't seen before......For example, if the accuracy of the decision tree classifier on the testing data is 0.8 (or 80%), this means that 80% of the predictions made by the model on the testing data were correct....Chance of survival could be done using various machine learning techniques, including decision trees, logistic regression, or support vector machines, among others.

  • Code Below prepares data for further analysis and provides an Accuracy. IMO, you would insert a new passenger and predict survival. Datasets could be used on various factors like prediction if a player will hit a Home Run, or a Stock will go up or down.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Split arrays or matrices into random train and test subsets.
X = td.drop('survived', axis=1)
y = td['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a decision tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Test the model
y_pred = dt.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('DecisionTreeClassifier Accuracy:', accuracy)

# Train a logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Test the model
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('LogisticRegression Accuracy:', accuracy)

My Hacks

2.3 Questions

  1. A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.

The researcher has access to a database with the following information about each student.

Last name First name Grade level (9, 10, 11, or 12) Grade point average (on a 0.0 to 4.0 scale) The researcher also has access to another database with the following information about each student.

First name Last name Number of absences from school Number of late arrivals to school

Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?

Answer: There will be many students at the school that share the same first name. Without a unique student ID, students with the same name may get confused with eachother. So if one student's GPA may get confused with anothers, which would cause the researcher's data and analysis to be incorrect.

  1. A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

Answer: Combining multiple data sets requires the involved data sets to have similar structures and calculations. One county may calculate pollution in different units (per sq acre vs per sq mile) than another county. County datasets may also differ in the frequency of data present. One county may have data on pollution for every year, but onther county may conduct that reaseach monthly.

  1. A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?

Answer: Anytime when user's are able to manually type into a field, there is always a chance the data collected from that field has "garbage" in it. Users may spell the city name wrong, use an abbreviation, or use a city nickname. A more user friendly program would plan for events like there by, for example, not only having city names in their data, but also city abbreviations.

  1. A database of information about shows at a concert venue contains the following information.

Name of artist performing at the show Date of show Total dollar amount of all tickets sold Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?

Answer: The average proce of tickets would be the most helpful in determining the artist with the greatest attendence. The attendance for a specific show can be calculated by dividing the dollar amount of tickets sold by the average ticket price would give you how many people have bought tickets for that show.

  1. A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?

Answer: Only a visual would show the number of bicycles the car passed. Neither the car’s speed, the date, time, or the car’s GPS location would be able to indicate the number of bicycles the care passed.

  1. A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.

On average, how long does homework take you each night (in minutes) ? On average, how long do you study for each test (in minutes) ? Do you enjoy the subject material of this class (yes or no) ? Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?

Do students who enjoy the subject material tend to spend more time on homework each night than the other students do? Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do? Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?

Answer: Option 3 requires data outside of the survey, specifically the student' grades. Both I and II only require data from within the survey.

1990 vs. 2017 Carbon Emissions per Nation

Purpose:

The purpose of this dataset is to investigate the Carbon emissions for nations in 2017 vs in 1990. It allows people to see which nations increased in their emissions, which decreased, which nations are the greatest contibuters... I found a dataset on Kaggle, which had CO2 emissions for years 1990-->2018. I created a json file, but only used the "Country" "1990 Carbon Emissions" and "2017 Carbon Emissions."

import pandas as pd
df = pd.read_json('files/co2em.json')
print(df)

#print raw
                 Country  Emissions1990  Emissions2017
0            Afghanistan       0.191745       0.131695
1                 Angola       0.553662       0.813301
2                Albania       1.819542       1.788786
3                Andorra       7.521832       6.104134
4   United Arab Emirates      30.195189      20.769022
..                   ...            ...            ...
89               Czechia      14.540000      10.230000
90           South Korea       5.777563      12.710000
91            Kazakhstan      14.512479      17.520000
92               Iceland       7.652280      10.480000
93          Turkmenistan      12.120051      12.290000

[94 rows x 3 columns]
print(df[['Country','Emissions1990','Emissions2017']].to_string(index=False))

#print data, but dont show index values
                 Country  Emissions1990  Emissions2017
             Afghanistan       0.191745       0.131695
                  Angola       0.553662       0.813301
                 Albania       1.819542       1.788786
                 Andorra       7.521832       6.104134
    United Arab Emirates      30.195189      20.769022
               Argentina       3.075541       4.071308
                 Armenia       5.610254       1.823560
          American Samoa       0.540000       1.490000
     Antigua and Barbuda       3.358227       5.239717
               Australia      15.448488      15.818316
                 Austria       7.589364       7.486162
              Azerbaijan       7.453451       3.243342
                 Burundi       0.031256       0.048952
                 Belgium      10.966775       8.151096
                   Benin       0.066285       0.614755
            Burkina Faso       0.056747       0.222474
              Bangladesh       0.111658       0.496852
                Bulgaria       8.443170       6.201290
                 Bahrain      21.656413      20.427328
                 Bahamas       7.649467       5.684363
  Bosnia and Herzegovina       5.370319       6.767051
                 Belarus       9.797487       5.921351
                  Belize       1.759493       1.703147
                Bermunda       8.680000       6.990000
                 Bolivia       1.516423       2.032547
                  Brazil       1.328159       2.196418
                Barbados       3.755753       4.122573
       Brunei Barussalam      12.600787      21.640000
                  Bhutan       0.150716       1.301030
                Botswana       2.183786       3.324148
Central African Republic       0.053443       0.047867
                  Canada      15.148890      15.544571
             Switzerland       6.452219       4.578885
                   Chile       2.299125       4.714020
                   China       1.914543       7.230985
           Cote d'Ivoire       0.227256       0.432123
                Cameroon       0.246178       0.373279
        Congo, Dem. Rep.       0.091876       0.035013
             Congo, Rep.       1.357808       1.058563
                Colombia       1.472091       1.533229
                 Comoros       0.170069       0.344027
              Cabo Verde       0.295899       1.097676
              Costa Rica       0.920038       1.668702
                    Cuba       3.361333       2.314085
          Cayman Islands       8.870000       2.310000
                  Cyprus       5.987352       6.230000
          Czech Republic      14.535453       6.238954
                 Germany      12.026609       8.858294
                Djibouti       0.457322       0.444868
                Dominica       0.852006       2.378953
                 Denmark      10.132390       5.746768
      Dominican Republic       1.101845       2.240060
                 Algeria       2.443430       3.825638
                 Ecuador       1.507194       2.296645
        Egypt, Arab Rep.       1.563210       2.511650
                 Eritrea       0.194898       0.210965
                   Spain       5.530353       5.654254
                 Estonia      22.483166      12.684228
                Ethiopia       0.048238       0.148684
                 Finland      10.941694       7.810154
                  Mexico       3.211460       3.790000
             New Zealand       6.700102       7.520000
                    Fiji       0.947054       1.880428
                  France       6.117208       4.749543
           Faroe Islands       0.030000       1.614915
   Micronesia, Fed. Sts.       1.270938       1.614915
                   Gabon       4.318094       2.581349
          United Kingdom       9.812990       5.555652
                 Georgia       6.974177       2.609976
                   Ghana       0.188855       0.584105
               Gibraltar       4.930000       0.124352
                  Guinea       0.157424       0.276776
                  Gambia       0.198829       0.243913
           Guinea-Bissau       0.174312       0.169571
      Equaatorial Guinea       0.143134       4.595851
                  Greece       7.416058       6.211250
                 Grenada       1.141932       2.615582
                  Russia      14.620000      11.370000
               Guatemala       0.434249       1.029997
                   Italy       7.145018       5.440000
                   India       0.645362       1.733361
           United States      19.407346      14.823261
                   Qatar      26.329327      32.127990
               Greenland       0.020000       9.690000
            Saudi Arabia      10.558843      18.820000
                 Somalia       0.101037       0.040000
                  Kuwait      13.902212      23.290000
     Trinidad and Tobago      10.768794      27.270000
                    Oman       6.577793      15.110000
                 Czechia      14.540000      10.230000
             South Korea       5.777563      12.710000
              Kazakhstan      14.512479      17.520000
                 Iceland       7.652280      10.480000
            Turkmenistan      12.120051      12.290000
print(df[['Emissions2017']])

print()

#try 3 columns and remove the index from print statement
print(df[['Emissions1990','Emissions2017','Country']].to_string(index=False))
    Emissions2017
0        0.131695
1        0.813301
2        1.788786
3        6.104134
4       20.769022
..            ...
89      10.230000
90      12.710000
91      17.520000
92      10.480000
93      12.290000

[94 rows x 1 columns]

 Emissions1990  Emissions2017                  Country
      0.191745       0.131695              Afghanistan
      0.553662       0.813301                   Angola
      1.819542       1.788786                  Albania
      7.521832       6.104134                  Andorra
     30.195189      20.769022     United Arab Emirates
      3.075541       4.071308                Argentina
      5.610254       1.823560                  Armenia
      0.540000       1.490000           American Samoa
      3.358227       5.239717      Antigua and Barbuda
     15.448488      15.818316                Australia
      7.589364       7.486162                  Austria
      7.453451       3.243342               Azerbaijan
      0.031256       0.048952                  Burundi
     10.966775       8.151096                  Belgium
      0.066285       0.614755                    Benin
      0.056747       0.222474             Burkina Faso
      0.111658       0.496852               Bangladesh
      8.443170       6.201290                 Bulgaria
     21.656413      20.427328                  Bahrain
      7.649467       5.684363                  Bahamas
      5.370319       6.767051   Bosnia and Herzegovina
      9.797487       5.921351                  Belarus
      1.759493       1.703147                   Belize
      8.680000       6.990000                 Bermunda
      1.516423       2.032547                  Bolivia
      1.328159       2.196418                   Brazil
      3.755753       4.122573                 Barbados
     12.600787      21.640000        Brunei Barussalam
      0.150716       1.301030                   Bhutan
      2.183786       3.324148                 Botswana
      0.053443       0.047867 Central African Republic
     15.148890      15.544571                   Canada
      6.452219       4.578885              Switzerland
      2.299125       4.714020                    Chile
      1.914543       7.230985                    China
      0.227256       0.432123            Cote d'Ivoire
      0.246178       0.373279                 Cameroon
      0.091876       0.035013         Congo, Dem. Rep.
      1.357808       1.058563              Congo, Rep.
      1.472091       1.533229                 Colombia
      0.170069       0.344027                  Comoros
      0.295899       1.097676               Cabo Verde
      0.920038       1.668702               Costa Rica
      3.361333       2.314085                     Cuba
      8.870000       2.310000           Cayman Islands
      5.987352       6.230000                   Cyprus
     14.535453       6.238954           Czech Republic
     12.026609       8.858294                  Germany
      0.457322       0.444868                 Djibouti
      0.852006       2.378953                 Dominica
     10.132390       5.746768                  Denmark
      1.101845       2.240060       Dominican Republic
      2.443430       3.825638                  Algeria
      1.507194       2.296645                  Ecuador
      1.563210       2.511650         Egypt, Arab Rep.
      0.194898       0.210965                  Eritrea
      5.530353       5.654254                    Spain
     22.483166      12.684228                  Estonia
      0.048238       0.148684                 Ethiopia
     10.941694       7.810154                  Finland
      3.211460       3.790000                   Mexico
      6.700102       7.520000              New Zealand
      0.947054       1.880428                     Fiji
      6.117208       4.749543                   France
      0.030000       1.614915            Faroe Islands
      1.270938       1.614915    Micronesia, Fed. Sts.
      4.318094       2.581349                    Gabon
      9.812990       5.555652           United Kingdom
      6.974177       2.609976                  Georgia
      0.188855       0.584105                    Ghana
      4.930000       0.124352                Gibraltar
      0.157424       0.276776                   Guinea
      0.198829       0.243913                   Gambia
      0.174312       0.169571            Guinea-Bissau
      0.143134       4.595851       Equaatorial Guinea
      7.416058       6.211250                   Greece
      1.141932       2.615582                  Grenada
     14.620000      11.370000                   Russia
      0.434249       1.029997                Guatemala
      7.145018       5.440000                    Italy
      0.645362       1.733361                    India
     19.407346      14.823261            United States
     26.329327      32.127990                    Qatar
      0.020000       9.690000                Greenland
     10.558843      18.820000             Saudi Arabia
      0.101037       0.040000                  Somalia
     13.902212      23.290000                   Kuwait
     10.768794      27.270000      Trinidad and Tobago
      6.577793      15.110000                     Oman
     14.540000      10.230000                  Czechia
      5.777563      12.710000              South Korea
     14.512479      17.520000               Kazakhstan
      7.652280      10.480000                  Iceland
     12.120051      12.290000             Turkmenistan
#sort the values in greatest to least 1990
print(df.sort_values(by=['Emissions1990'], ascending=False))

print()

#sort the values in greatest to least 2017
print(df.sort_values(by=['Emissions2017'], ascending=False))

print()
                     Country  Emissions1990  Emissions2017
4       United Arab Emirates      30.195189      20.769022
82                     Qatar      26.329327      32.127990
57                   Estonia      22.483166      12.684228
18                   Bahrain      21.656413      20.427328
81             United States      19.407346      14.823261
..                       ...            ...            ...
30  Central African Republic       0.053443       0.047867
58                  Ethiopia       0.048238       0.148684
12                   Burundi       0.031256       0.048952
64             Faroe Islands       0.030000       1.614915
83                 Greenland       0.020000       9.690000

[94 rows x 3 columns]

                     Country  Emissions1990  Emissions2017
82                     Qatar      26.329327      32.127990
87       Trinidad and Tobago      10.768794      27.270000
86                    Kuwait      13.902212      23.290000
27         Brunei Barussalam      12.600787      21.640000
4       United Arab Emirates      30.195189      20.769022
..                       ...            ...            ...
70                 Gibraltar       4.930000       0.124352
12                   Burundi       0.031256       0.048952
30  Central African Republic       0.053443       0.047867
85                   Somalia       0.101037       0.040000
37          Congo, Dem. Rep.       0.091876       0.035013

[94 rows x 3 columns]

print(df[df.Emissions1990 > 20.00])
print()
print(df[df.Emissions2017 > 20.00])
                 Country  Emissions1990  Emissions2017
4   United Arab Emirates      30.195189      20.769022
18               Bahrain      21.656413      20.427328
57               Estonia      22.483166      12.684228
82                 Qatar      26.329327      32.127990

                 Country  Emissions1990  Emissions2017
4   United Arab Emirates      30.195189      20.769022
18               Bahrain      21.656413      20.427328
27     Brunei Barussalam      12.600787      21.640000
82                 Qatar      26.329327      32.127990
86                Kuwait      13.902212      23.290000
87   Trinidad and Tobago      10.768794      27.270000
print("------Here are the nations with the most and least CO2 Emissions in 1990------")
print(df[df.Emissions1990 == df.Emissions1990.max()])
print()
print(df[df.Emissions1990 == df.Emissions1990.min()])
print()


print("------Here are the nations with the most and least CO2 Emissions in 2017-----")
print(df[df.Emissions2017 == df.Emissions2017.max()])
print()
print(df[df.Emissions2017 == df.Emissions2017.min()])
------Here are the nations with the most and least CO2 Emissions in 1990------
                Country  Emissions1990  Emissions2017
4  United Arab Emirates      30.195189      20.769022

      Country  Emissions1990  Emissions2017
83  Greenland           0.02           9.69

------Here are the nations with the most and least CO2 Emissions in 2017-----
   Country  Emissions1990  Emissions2017
82   Qatar      26.329327       32.12799

             Country  Emissions1990  Emissions2017
37  Congo, Dem. Rep.       0.091876       0.035013
mean = df["Emissions1990"].mean()
print("Average 1990 CO2 emissions per nation is", mean, "metric tons per capita")

#mean
mean = df["Emissions2017"].mean()
print("Average 2017 CO2 emissions per nation is", mean, "metric tons per capita")
Average 1990 CO2 emissions per nation is 5.649072744255316 metric tons per capita
Average 2017 CO2 emissions per nation is 5.828730248755319 metric tons per capita
sum2017 = sum(df['Emissions2017'])
sum1990 = sum(df['Emissions1990'])