Reads data from an open data file. Show
Importing data into FileMaker Pro Advanced Automating tasks with scripts •File ID is a numeric expression that specifies the file ID of an open data file. See Open Data File script step. •Amount specifies the number of bytes to read if Read as is UTF-8 or Bytes. For UTF-16, Amount is the number of Unicode code units to read. If Amount is not specified (set to nothing), the entire file is read. •Target is the field or variable to store the data read from the file. •Read as specifies the character encoding of the file. •UTF-16 is FileMaker's default 16-bit Unicode (UTF-16 little endian) character encoding. •UTF-8 is 8-bit Unicode character encoding. •Bytes assumes no specific character encoding, but instead reads one byte at a time. If the target is a variable or container field, data read from the file is stored as container data with the same filename as the specified file.
FileMaker Pro 18.0 Advanced This script step reads data specified by Amount starting at the current read-write position, or it reads the entire file if Amount is not specified. See Set Data File Position. The maximum amount of data this script step can read at a time is 64 MB. If a file is larger than 64 MB, you can perform this script step multiple times, reading an amount no larger than 64 MB each time. •Performance is best when reading no more than 64 KB. •When reading Unicode text from a portion of a file, only part of a character may be read if the character consists of multiple code units. If you expect to read text from a UTF-8 or UTF-16 file, it may be safer to read the entire file at one time unless you are sure of the number of code points per character. Reads the contents of an open file with a file ID of 2 and stores the data in a variable. Read from Data File [ File ID: 2 ; Amount (bytes): ; Target: $variable ; Read as: Bytes ]Checks whether a file named change.log exists in the Documents folder. If it does, reads the first 100 UTF-8 code units into the Utilities::Log Entry text field. Set Variable [ $file ; Value: Get ( DocumentsPath ) & "change.log" ]Get File Exists [ "$file" ; Target: $fileExists ]If [ not $fileExists ] Exit Script [ Text Result: ]End IfOpen Data File [ "$file" ; Target: $fileID ]Read from Data File [ File ID: $fileID ; Amount (bytes): 100 ; Target: Utilities::Log Entry ; Read as: UTF-8 ]Close Data File [ File ID: $fileID ] If the large.log file exists and is large, reads the file 64 MB at a time into a variable until the entire file is read. If the file is smaller, it reads the entire file in one read operation. Data is read into a variable for improved performance, then the data is stored in a field. Set Error Capture [ On ]Set Variable [ $k_FileMissingError ; Value: 100 ] Set Variable [ $k_EndOfFileError ; Value: 10 ]Set Variable [ $k_64KB ; Value: 64 * 1024 ]Set Variable [ $k_64MB ; Value: 64 * 1024 * 1024 ]Set Variable [ $file ; Value: "large.log" ]Get File Exists [ "$file" ; Target : $fileExists ]If [ $fileExists = 0 ] Exit Script [ Result: $k_FileMissingError ]End IfOpen Data File [ "$file" ; Target: $fileID ] #If the file opened successfully, read the contents.If [ Get ( LastError ) = 0 ] Get File Size [ "$file" ; Target: $fileSize ] #If the file size is greater than 64 KB, read it 64 MB at a time. If [ $fileSize > $k_64KB ] Loop #Read up to 64 MB and store it in a variable. Read from Data File [ File ID: $fileID ; Amount (bytes): $k_64MB ; Target: $dataChunk ; Read as: UTF-8 ] Set Variable [ $readError; Value:Get ( LastError ) ] #If the read operation was sucessful or if the end of the file was reached, concatenate the data read in this pass ($dataChunk) with the data read previously ($fileContents). If [ ( $readError = 0 ) or ( $readError = $k_EndOfFileError ) ] Set Variable [ $fileContents ; Value: $fileContents & $dataChunk ] End If #Exit the loop if the read operation failed or the end of the file is reached. Exit Loop If [ $readError ] End Loop Else #If the file is no larger than 64 KB, read it all at once. Read from Data File [ File ID: $fileID ; Target: $fileContents ; Read as: UTF-8 ] End If #Close the data file and store the contents in a field. Close Data File [ File ID: $fileID ] Set Field [ Table::Data ; $fileContents ]End If
In order to continue enjoying our site, we ask that you confirm your identity as a human. Thank you very much for your cooperation.
In all of our examples so far, we have directly allocated values to variables in the code we have written before using the variables. Python has an input() function which will ask for input from the user, but for any large amounts of data this will be an impractical way of collecting data. The reality is that most of the data that your program uses will be read from a file. Additionally, apart from when you are developing, most of your program output will be written to a file. In this episode we will look at how to read and write files of data in Python. There are in fact many different approaches to reading data files and which one you choose will depend on such things as the size of the file and the data format of the file. In this episode we will;
The file we will be using is only a small file (131 data records), but the approach we are using will work for any size of file. Imagine 131M records. This is because we only process one record at a time so the memory requirements of the programs will be very small. The larger the file the more the processing time required. Other approaches to reading files will typically expect to read the whole file in one go. This can be efficient when you subsequently process the data but it require large amounts of memory to hold the entire file. We will look at this approach later when the look at the Pandas package. For our examples in this episode we are going to use the SAFI_results.csv file. This is available for download here and the description of the file is available here. The code assumes that the file is in the same directory as your notebook. We will build up our programs in simple steps. Step 1 - Open the file , read through it and close the filewith open("SAFI_results.csv") as f: # Open the file and assign it to a new variable which we call 'f'. # The file will be read-only by default. # As long as the following code is indented, the file 'f' will be open. for line in f: # We use a for loop to iterate through the file one line at a time. print(line) # We simply print the line. print("I'm on to something else now.") # When we are finished with this file, we stop indenting the code and the file is closed automatically. Column1,A01_interview_date,A03_quest_no,A04_start,A05_end,A06_province,A07_district,A08_ward,A09_village,A11_years_farm,A12_agr_assoc,B11_remittance_money,B16_years_liv,B17_parents_liv,B18_sp_parents_liv,B19_grand_liv,B20_sp_grand_liv,B_no_membrs,C01_respondent_roof_type,C02_respondent_wall_type,C02_respondent_wall_type_other,C03_respondent_floor_type,C04_window_type,C05_buildings_in_compound,C06_rooms,C07_other_buildings,D_plots_count,E01_water_use,E17_no_enough_water,E19_period_use,E20_exper_other,E21_other_meth,E23_memb_assoc,E24_resp_assoc,E25_fees_water,E26_affect_conflicts,E_no_group_count,E_yes_group_count,F04_need_money,F05_money_source_other,F06_crops_contr,F08_emply_lab,F09_du_labour,F10_liv_owned_other,F12_poultry,F13_du_look_aftr_cows,F_liv_count,G01_no_meals,_members_count,_note,gps:Accuracy,gps:Altitude,gps:Latitude,gps:Longitude,instanceID 0,17/11/2016,1,2017-03-23T09:49:57.000Z,2017-04-02T17:29:08.000Z,Province1,District1,Ward2,Village2,11,no,no,4,no,yes,no,yes,3,grass,muddaub,,earth,no,1,1,no,2,no,,,,,,,,,2,,,,,no,no,,yes,no,1,2,3,,14,698,-19.11225943,33.48345609,uuid:ec241f2c-0609-46ed-b5e8-fe575f6cefef ... You can think of the file as being a list of strings. Each string in the list is one complete line from the file. If you look at the output, you can see that the first record in the file is a header record containing column names. When we read a file in this way, the column headings have no significance, the computer sees them as another record in the file. Step 2 - Select a specific ‘column’ from the records in the fileWe know that the first record in the file is a header record and we want to ignore it. To do this we call the readline() method of the file handle f. We don’t need to assign the line that it will return to a variable as we are not going to use it. As we read the file the line variable is a string containing a complete record. The fields or columns of the record are separated by each other by “,” as it is a csv file. As line is a string we can use the split() method to convert it to a list of column values. We are specicically going to select the column which is the 19th entry in the list (remember the list index starts at 0). This refers to the C01_respondent_roof_type column. We are going to examine the different roof types. with open ("SAFI_results.csv") as f: # Open the file and assign it to a variable called 'f'. # Indent the code to keep the file open. Stop indenting to close. f.readline() # First line is a header so ignore it. for line in f: print(line.split(",")[18]) # Index 18, the 19th column is C01_respondent_roof_type. grass grass mabatisloping mabatisloping grass grass grass mabatisloping ... Having a list of the roof types from all of the records is one thing, but it is more likely that we would want a count of each type. By scanning up and down the previous output, there appear to be 3 different types, but we will play safe and assume there may be more. Step 3 - How many of each different roof types are there?# 1 with open ("SAFI_results.csv") as f: # 2 f.readline() # 3 grass_roof = 0 mabatisloping_roof = 0 mabatipitched_roof = 0 roof_type_other = 0 for line in f: # 4 roof_type = line.split(",")[18] # 5 if roof_type == 'grass' : grass_roof += 1 elif roof_type == 'mabatisloping' : mabatisloping_roof += 1 elif roof_type == 'mabatipitched' : mabatipitched_roof += 1 else : roof_type_other += 1 #6 print("There are ", grass_roof, " grass roofs") print("There are ", mabatisloping_roof, " mabatisloping roofs") print("There are ", mabatipitched_roof, " mabatipitchedg roofs") print("There are ", roof_type_other, " other roof types") There are 73 grass roofs There are 48 mabatisloping roofs There are 10 mabatipitchedg roofs There are 0 other roof types What are we doing here?
Instead of printing out the counts of the roof types, you may want to extract all of one particular roof type to a separate file. Let us assume we want all of the grass roof records to be written to a file. # 1 with open ("SAFI_results.csv") as fr: # Note how we have used a new variable name, 'fr'. # The file is read-only by default. with open ("SAFI_grass_roof.csv", "w") as fw: # We are keeping 'fr' open so we indent. # We specify a second parameter, "w" to make this file writeable. # We use a different variable, 'fw'. for line in fr: # 2 if line.split(",")[18] == 'grass' : fw.write(line) What are we doing here?
In this example we didn’t bother skipping the header line as it would fail the test in the if statement. If we did want to include it we could have added the line before the for loop
In our example of printing the counts for the roof types, we assumed that we knew what the likely roof types were. Although we did have an 'other' option to catch anything we missed. Had there been any we would still be non the wiser as to what they represented. We were able to decide on the specific roof types by manually scanning the list of C01_respondent_roof_type values. This was only practical because of the small file size. For a multi-million record file we could not have done this. We would like a way of creating a list of the different roof types and at the same time counting them. We can do this by using not a Python list structure, but a Python dictionary. The Python dictionary structureIn Python a dictionary object maps keys to values. A dictionary can hold any number of keys and values but a key cannot be duplicated. The following code shows examples of creating a dictionary object and manipulating keys and values. # an empty dictionary myDict = {} # A dictionary with a single Key-value pair personDict = {'Name' : 'Peter'} # I can add more about 'Peter' to the dictionary personDict['Location'] = 'Manchester' # I can print all of the keys and values from the dictionary print(personDict.items()) # I can print all of the keys and values from the dictionary - and make it look a bit nicer for item in personDict: print(item, "=", personDict[item]) # or all of the keys print(personDict.keys()) # or all of the values print(personDict.values()) # I can access the value for a given key x = personDict['Name'] print(x) # I can change value for a given key personDict['Name'] = "Fred" print(personDict['Name']) # I can check if a key exists key = 'Name' if key in personDict : print("already exists") else : personDict[key] = "New value" dict_items([('Location', 'Manchester'), ('Name', 'Peter')]) Location = Manchester Name = Peter dict_keys(['Location', 'Name']) dict_values(['Manchester', 'Peter']) Peter Fred already exists
We are now in a position to re-write our count of roof types example without knowing in advance what any of the roof types are. # 1 with open ("SAFI_results.csv") as f: # 2 f.readline() # 3 dict_roof_types = {} for line in f: # 4 roof_type = line.split(",")[18] # 5 if roof_type in dict_roof_types : dict_roof_types[roof_type] += 1 else : dict_roof_types[roof_type] = 1 # 6 for item in dict_roof_types: print(item, "=", dict_roof_types[item]) grass = 73 mabatisloping = 48 mabatipitched = 10 What are we doing here?
You can apply the same approach to count values in any of the fields/columns of the file.
Page 2
Python can be very flexible in how it interprets ‘strings’ which you want to be considered as a date, time, or date and time, but you have to tell Python how the various parts of the date and/or time are represented in your ‘string’. You can do this by creating a format. In a format, different case sensitive characters preceded by the % character act as placeholders for parts of the date/time, for example %Y represents year formatted as 4 digit number such as 2014. A full list of the characters used and what they represent can be found towards the end of the datetime section of the official Python documentation. There is a today() method which allows you to get the current date and time. By default it is displayed in a format similar to the ISO 8601 standard format. To use the date and time functions you need to import the datetime module. from datetime import datetime today = datetime.today() print('ISO :', today) ISO : 2018-04-12 16:19:17.177441 We can use our own formatting instead. For example, if we wanted words instead of number and the 4 digit year at the end we could use the following. format = "%a %b %d %H:%M:%S %Y" today_str = today.strftime(format) print('strftime:', today_str) print(type(today_str)) today_date = datetime.strptime(today_str, format) print('strptime:', today_date.strftime(format)) print(type(today_date)) strftime: Thu Apr 12 16:19:17 2018 <class 'str'> strptime: Thu Apr 12 16:19:17 2018 <class 'datetime.datetime'> strftime converts a datetime object to a string and strptime creates a datetime object from a string. When you print them using the same format string, they look the same. The format of the date fields in the SAFI_results.csv file have been generated automatically to comform to the ISO 8601 standard. When we read the file and extract the date fields, they are of type string. Before we can use them as dates, we need to convert them into Python date objects. In the format string we use below, the - , : , T and Z characters are just that, characters in the string representing the date/time. Only the character preceded with % have special meanings. Having converted the strings to datetime objects, there are a variety of methods that we can use to extract different components of the date/time. from datetime import datetime format = "%Y-%m-%dT%H:%M:%S.%fZ" f = open('SAFI_results.csv', 'r') #skip the header line line = f.readline() # next line has data line = f.readline() strdate_start = line.split(',')[3] # A04_start strdate_end = line.split(',')[4] # A05_end print(type(strdate_start), strdate_start) print(type(strdate_end), strdate_end) # the full date and time datetime_start = datetime.strptime(strdate_start, format) print(type(datetime_start)) datetime_end = datetime.strptime(strdate_end, format) print('formatted date and time', datetime_start) print('formatted date and time', datetime_end) # the date component date_start = datetime.strptime(strdate_start, format).date() print(type(date_start)) date_end = datetime.strptime(strdate_end, format).date() print('formatted start date', date_start) print('formatted end date', date_end) # the time component time_start = datetime.strptime(strdate_start, format).time() print(type(time_start)) time_end = datetime.strptime(strdate_end, format).time() print('formatted start time', time_start) print('formatted end time', time_end) f.close() <class 'str'> 2017-03-23T09:49:57.000Z <class 'str'> 2017-04-02T17:29:08.000Z <class 'datetime.datetime'> formatted date and time 2017-03-23 09:49:57 formatted date and time 2017-04-02 17:29:08 <class 'datetime.date'> formatted start date 2017-03-23 formatted end date 2017-04-02 <class 'datetime.time'> formatted start time 09:49:57 formatted end time 17:29:08 For a date or time we can also extract individual components of them. They are held internally in the datetime datastructure. # date parts. print('formatted end date', date_end) print(' end date year', date_end.year) print(' end date month', date_end.month) print(' end date day', date_end.day) print (type(date_end.day)) # time parts. print('formatted end time', time_end) print(' end time hour', time_end.hour) print(' end time minutes', time_end.minute) print(' end time seconds', time_end.second) print(type(time_end.second)) formatted end date 2017-04-02 end date year 2017 end date month 4 end date day 2 <class 'int'> formatted end time 17:29:08 end time hour 17 end time minutes 29 end time seconds 8 <class 'int'> We can also do arithmetic with the dates. date_diff = datetime_end - datetime_start date_diff print(type(datetime_start)) print(type(date_diff)) print(date_diff) date_diff = datetime_start - datetime_end print(type(date_diff)) print(date_diff) <class 'datetime.datetime'> <class 'datetime.timedelta'> 10 days, 7:39:11 <class 'datetime.timedelta'> -11 days, 16:20:49
The code below calculates the time difference between supposedly starting the survey and ending the survey (for each respondent). from datetime import datetime format = "%Y-%m-%dT%H:%M:%S.%fZ" f = open('SAFI_results.csv', 'r') line = f.readline() for line in f: #print(line) strdate_start = line.split(',')[3] strdate_end = line.split(',')[4] datetime_start = datetime.strptime(strdate_start, format) datetime_end = datetime.strptime(strdate_end, format) date_diff = datetime_end - datetime_start print(datetime_start, datetime_end, date_diff ) f.close() 2017-03-23 09:49:57 2017-04-02 17:29:08 10 days, 7:39:11 2017-04-02 09:48:16 2017-04-02 17:26:19 7:38:03 2017-04-02 14:35:26 2017-04-02 17:26:53 2:51:27 2017-04-02 14:55:18 2017-04-02 17:27:16 2:31:58 2017-04-02 15:10:35 2017-04-02 17:27:35 2:17:00 2017-04-02 15:27:25 2017-04-02 17:28:02 2:00:37 2017-04-02 15:38:01 2017-04-02 17:28:19 1:50:18 2017-04-02 15:59:52 2017-04-02 17:28:39 1:28:47 2017-04-02 16:23:36 2017-04-02 16:42:08 0:18:32 ...
Page 3
In the Processing data from file episode we introduced the dictionary object. We created dictionaries and we added key : value pairs to the dictionary. In all of the examples that we used, the value was always a simple data type like an integer or a string. The value associated with a key in a dictionary can be of any type including a list or even another dictionary. We created a simple dictionary object with the following code: personDict = {'Name' : 'Peter'} personDict['Location'] = 'Manchester' print(personDict) {'Name': 'Peter', 'Location': 'Manchester'} So far the keys in the dictionary each relate to a single piece of information about the person. What if we wanted to add a list of items? personDict['Children'] = ['John', 'Jane', 'Jack'] personDict['Children_count'] = 3 print(personDict) {'Name': 'Peter', 'Children': ['John', 'Jane', 'Jack'], 'Children_count': 3, 'Location': 'Manchester'} Not only can I have a key where the value is a list, the value could also be another dictionary object. Suppose I want to add some telephone numbers personDict['phones'] = {'home' : '0102345678', 'mobile' : '07770123456'} print(personDict.values()) # adding another phone personDict['phones']['business'] = '0161234234546' print(personDict) dict_values(['Peter', ['John', 'Jane', 'Jack'], {'home': '0102345678', 'mobile': '07770123456'}, 3, 'Manchester']) {'Name': 'Peter', 'Children': ['John', 'Jane', 'Jack'], 'phones': {'home': '0102345678', 'mobile': '07770123456', 'business': '0161234234546'}, 'Children_count': 3, 'Location': 'Manchester'}
The ability to create dictionaries containing lists and other dictionaries, makes the dictionary object very versatile, you can create an arbitrarily complex data structure of dictionaries within dictionaries. In practice you will not be doing this manually, instead like most data you will read it in from a file. The JSON data formatThe JSON data format was designed as a way of allowing different machines or processes within machines to communicate with each other by sending messages constructed in a well defined format. JSON is now the preferred data format used by APIs (Application Programming Interfaces). The JSON format although somewhat verbose is not only Human readable but it can also be mapped very easily to a Python dictionary object. We are going to read a file of data formatted as JSON, convert it into a dictionary object in Python then selectively extract Key-Value pairs and create a csv file from the extracted data. The JSON file we are going to use is the SAFI.json file. This is the output file from an electronic survey system called ODK. The JSON represents the answers to a series of survey questions. The questions themselves have been replaced with unique Keys, the values are the answers. Because detailed surveys are by nature nested structures making it possible to record different levels of detail or selectively ask a set of specific questions based on the answer given a previous question, the structure of the answers for the survey can not only be complex and convoluted, it could easily be different from one survey respondent’s set of answers to another. Advantages of JSON
Dis-advantages of JSON
Viewing JSON data
Use the JSON package to read a JSON fileimport json with open('SAFI.json') as json_data: d = json.load(json_data) print(type(d)) print(type(d[0])) print(json.dumps(d[0], indent=2)) <class 'list'> <class 'dict'> { "G02_months_lack_food": [ "Jan" ], "G01_no_meals": 2, "E_no_group_count": "2", "A03_quest_no": "01", ... Points to note:
By default the order in which the keys of the dictionary are printed is not guaranteed. If we want them in sorted order we can have them sorted by using the sort_keys parameter print(json.dumps(d[0], indent=2, sort_keys=True)) { "A01_interview_date": "2016-11-17", "A03_quest_no": "01", "A04_start": "2017-03-23T09:49:57.000Z", "A05_end": "2017-04-02T17:29:08.000Z", "A06_province": "province1", ... } If we want to extract fields from a JSON document, the first step isto convert the JSON document into a Python dictionary. We have in fact already done this with the line. d a list object and each entry in the list is a Dictionary object. Extract the fields we want into a flat formatDespite the arbitrary complexity of a JSON document or a Python dictionary object, we can adopt a very systematic approach to extracting individual fields form the structure. The story so far: Our JSON file has been read into a variable d. We know that d is a list of dictionaries. Each dictionary represents a JSON document ( a record). We can print the contents of the first dictionary in the list with print(json.dumps(d[0], indent=2, sort_keys=True))
Being able to drill down in this way is very useful in helping you get a feel for the JSON data structure. In practice it is more likely that instead of returning the first occurrence of D_curr_crop you will want to return all of them. This requires a little more programming and to be aware of two potential problems.
In our first attempt we will ignore these problems. for farms in d: plot = farms['D_plots'] for crops in plot: crop = crops['D_crops'] for curr_crops in crop: print(curr_crops['D_curr_crop']) maize maize maize tomatoes vegetable maize maize maize sorghum ... In this version we test if all of the keys exist. This could be extended to check that the lists are not empty. for farms in d: if 'D_plots' in farms : plot = farms['D_plots'] for crops in plot: if 'D_crops' in crops : crop = crops['D_crops'] for curr_crops in crop: if 'D_curr_crop' in curr_crops: print(curr_crops['D_curr_crop']) We can now produce a list of all of the crops in all of the plots in all of the farms. We can also create a unique set of all of the crops grown using the Python set data structure as shown in the code below. A set is like a list but does not allow duplicate values (but doesn’t raise an error if you try to add a duplicate). unique_crops = set() for farms in d: if 'D_plots' in farms : plot = farms['D_plots'] for crops in plot: if 'D_crops' in crops : crop = crops['D_crops'] for curr_crops in crop: if 'D_curr_crop' in curr_crops: #print(curr_crops['D_curr_crop']) unique_crops.add(curr_crops['D_curr_crop']) print(unique_crops) {'peanut', 'potatoes', 'tomatoes', 'other', 'vegetable', 'amendoim', 'sunflower', 'bananas', 'sesame', None, 'cucumber', 'onion', 'sorghum', 'piri_piri', 'baby_corn', 'cabbage', 'ngogwe', 'maize', 'pigeonpeas', 'beans'} Simply having a list of all of the crops is unlikely to be enough. What you are really interested in is which farm grows which crops in which plot. We can accumulate this information as we move through the list of dictionary objects. At the top level, farm, there is a unique identifier A03_quest_no which we can use. for the plot and the crop within the plot we will create our own simple indexing system (plot_no and crop_no). At the end instead of just printing the crop name, we also print the details of where this crop is being grown. for farms in d: plot_no = 0 id = farms['A03_quest_no'] if 'D_plots' in farms : plot = farms['D_plots'] for crops in plot: crop_no = 0 plot_no += 1 if 'D_crops' in crops : crop = crops['D_crops'] for curr_crops in crop: crop_no += 1 if 'D_curr_crop' in curr_crops: print("Farm no ", id," grows ", curr_crops['D_curr_crop']," in plot", plot_no , " and it is crop number ", crop_no) Farm no 01 grows maize in plot 1 and it is crop number 1 Farm no 01 grows maize in plot 2 and it is crop number 1 Farm no 01 grows maize in plot 1 and it is crop number 1 Farm no 01 grows tomatoes in plot 2 and it is crop number 1 Farm no 01 grows vegetable in plot 3 and it is crop number 1 ... The final stage of this data extraction process is to save the extracted data to a file for subsequent use. Rather than manually appending all of the information items into a string with , seperating each, we can use the csv module. To do this we need to create a csv.writer object and use it to write complete rows of data at a time. csv.writer expects the data to be provided as a list of items. For the header row we provide a list of strings containing the colmn names we want and at the end we proivide the data items in a list as well. import csv filename = "SAFI_crops.csv" fw = open(filename, 'w') cf = csv.writer(fw, lineterminator='\n') # write the header cf.writerow(["Farm","plot_no","plot_area","crop_no","crop_name"]) for farms in d: plot_no = 0 id = farms['A03_quest_no'] if 'D_plots' in farms : plot = farms['D_plots'] for crops in plot: crop_no = 0 plot_no += 1 if 'D_crops' in crops : plot_area = crops['D02_total_plot'] crop = crops['D_crops'] for curr_crops in crop: crop_no += 1 if 'D_curr_crop' in curr_crops: #print( id, plot_no , plot_area , crop_no, curr_crops['D_curr_crop']) cf.writerow([id, plot_no , plot_area , crop_no, curr_crops['D_curr_crop']]) fw.close()
Page 4
pandas is a Python library containing a set of functions and specialised data structures that have been designed to help Python programmers to perform data analysis tasks in a structured way. Most of the things that pandas can do can be done with basic Python, but the collected set of pandas functions and data structure makes the data analysis tasks more consistent in terms of syntax and therefore aids readabilty. Particular features of pandas that we will be looking at over this and the next couple of episodes include:
If you are wondering why I write pandas with a lower case ‘p’ it is because it is the name of the package and Python is case sensitive. Importing the pandas libraryImporting the pandas library is done in exactly the same way as for any other library. In almost all examples of Python code using the pandas library, it will have been imported and given an alias of pd. We will follow the same convention. Pandas data structuresThere are two main data structure used by pandas, they are the Series and the Dataframe. The Series equates in general to a vector or a list. The Dataframe is equivalent to a table. Each column in a pandas Dataframe is a pandas Series data structure. We will mainly be looking at the Dataframe. We can easily create a Pandas Dataframe by reading a .csv file Reading a csv fileWhen we read a csv dataset in base Python we did so by opening the dataset, reading and processing a record at a time and then closing the dataset after we had read the last record. Reading datasets in this way is slow and places all of the responsibility for extracting individual data items of information from the records on the programmer. The main advantage of this approach, however, is that you only have to store one dataset record in memory at a time. This means that if you have the time, you can process datasets of any size. In Pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset. All of the dataset records are assembled into a Dataframe. If your dataset has column headers in the first record then these can be used as the Dataframe column names. You can explicitly state this in the parameters to the call, but pandas is usually able to infer that there ia a header row and use it automatically. For our examples in this episode we are going to use the SN7577.tab file. This is available for download here and the description of the file is available here We are going to read in our SN7577.tab file. Although this is a tab delimited file we will still use the pandas read_csv method, but we will explicitly tell the method that the separator is the tab character and not a comma which is the default. df_SN7577 = pd.read_csv("SN7577.tab", sep='\t')
You can find out the type of the variable df_SN7577 by using the type function. <class 'pandas.core.frame.DataFrame'> You can see the contents by simply entering the variable name. You can see from the output that it is a tabular format. The column names have been taken from the first record of the file. On the left hand side is a column with no name. The entries here have been provided by pandas and act as an index to reference the individual rows of the Dataframe. The read_csv() function has an index_col parameter which you can use to indicate which of the columns in the file you wish to use as the index instead. As the SN7577 dataset doesn’t have a column which would uniquely identify each row we cannot do that. Another thing to notice about the display is that it is truncated. By default you will see the first and last 30 rows. For the columns you will always get the first few columns and typically the last few depending on display space. Similar information can be obtained with df_SN7577.head() But here you are only returned the first 5 rows by default.
You can obtain other basic information about your Dataframe of data with: # How many rows? print(len(df_SN7577)) # How many rows and columns - returned as a tuple print(df_SN7577.shape) #How many 'cells' in the table print(df_SN7577.size) # What are the column names print(df_SN7577.columns) # what are the data types of the columns? print(df_SN7577.dtypes) 1286 (1286, 202) 259772 Index(['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 'Q5avi', ... 'numhhd', 'numkid', 'numkid2', 'numkid31', 'numkid32', 'numkid33', 'numkid34', 'numkid35', 'numkid36', 'wts'], dtype='object', length=202) Q1 int64 Q2 int64 Q3 int64 ... Length: 202, dtype: object
Page 5
We will continue this episode from where we left off in the last episode. If you have restarted Jupyter or you want to use a new notebook make sure that you import pandas and have read the SN7577.tab dataset into a Dataframe. import pandas as pd df_SN7577 = pd.read_csv("SN7577.tab", sep='\t') Selecting rows and columns from a pandas DataframeIf we know which columns we want before we read the data from the file we can tell read_csv() to only import those columns by specifying columns either by their index number (starting at 0) as a list to the usecols parameter. Alternatively we can also provide a list of column names. df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= [0,1,2,173,174,175]) print(df_SN7577_some_cols.shape) print(df_SN7577_some_cols.columns) df_SN7577_some_cols = pd.read_csv("SN7577.tab", sep='\t', usecols= ['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups']) print(df_SN7577_some_cols.columns) (1286, 6) Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object') Index(['Q1', 'Q2', 'Q3', 'sex', 'age', 'agegroups'], dtype='object') Let us assume for now that we read in the complete file which is now in the Dataframe df_SN7577, how can we now refer to specific columns? There are two ways of doing this using the column names (or labels): # Both of these statements are the same print(df_SN7577['Q1']) # and print(df_SN7577.Q1) If we are interested in more than one column, the 2nd method above cannot be used. However in the first, although we used a string with the value of 'Q1' we could also have provided a list of strings. Remember that lists are enclosed in []. print(df_SN7577[['Q1', 'Q2', 'Q3']]) Q1 Q2 Q3 0 1 -1 1 1 3 -1 1 2 10 3 2 3 9 -1 10 ...
Filtering by RowsYou can filter the Dataframe by rows by specifying a range in the form of a:b. a is the first row and b is one beyond the last row required. # select row with index of 1, 2 and 3 (rows 2, 3 and 4 in the Dataframe) df_SN7577_some_rows = df_SN7577[1:4] df_SN7577_some_rows
Using criteria to filter rowsIt is more likely that you will want to select rows from the Dataframe based on some criteria, such as “all rows where the value for Q2 is -1”. df_SN7577_some_rows = df_SN7577[(df_SN7577.Q2 == -1)] df_SN7577_some_rows The criteria can be more complex and isn’t limited to a single column’s values: df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)] df_SN7577_some_rows We can combine the row selection with column selection: df_SN7577_some_rows = df_SN7577[ (df_SN7577.Q2 == -1) & (df_SN7577.numage > 60)][['Q1', 'Q2','numage']] df_SN7577_some_rows Selecting rows on the row index is of limited use unless you need to select a contiguous range of rows. There is however another way of selecting rows using the row index: df_SN7577_some_rows = df_SN7577.iloc[1:4] df_SN7577_some_rows Using the iloc method gives the same results as our previous example. However, now we can specify a single value and more importantly we can use the range() function to indicate the records that we want. This can be useful for making pseudo-random selections of rows from across the Dataframe. # Select the first row from the Dataframe df_SN7577_some_rows = df_SN7577.iloc[0] df_SN7577_some_rows # select every 100th record from the Dataframe. df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100)] df_SN7577_some_rows You can also specify column ranges using the iloc method again using the column index numbers: # columns 0,1,2 and 3 df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),0:4] df_SN7577_some_rows # columns 0,1,2,78 and 95 df_SN7577_some_rows = df_SN7577.iloc[range(0, len(df_SN7577), 100),[0,1,2,78,95]] df_SN7577_some_rows There is also a loc method which allows you to use the column names. # columns 0,1,2,78 and 95 using the column names and changing 'iloc' to 'loc' df_SN7577_some_rows = df_SN7577.loc[range(0, len(df_SN7577), 100),['Q1', 'Q2', 'Q3', 'Q18bii', 'access6' ]] df_SN7577_some_rows SamplingPandas does have a sample method which allows you to extract a sample of the records from the Dataframe. df_SN7577.sample(10, replace=False) # ten records, do not select same record twice (this is the default) df_SN7577.sample(frac=0.05, random_state=1) # 5% of records , same records if run again
Page 6
For variables which contain numerical values we are often interested in various statistical measures relating to those values. For categorical variable we are often interested in the how many of each unique values are present in the dataset. We shall use the SAFI_results.csv dataset to demonstrate how we can obtain these pieces of information import pandas as pd df_SAFI = pd.read_csv("SAFI_results.csv") df_SAFI For numeric variables we can obtain a variety of basic statistical information by using the describe() method. This can be done for the Dataframe as a whole, in which case some of the results might have no sensible meaning. If there are any missing values, represented in the display as NaN you will get a warning message. You can also .describe() on a single variable basis. df_SAFI['B_no_membrs'].describe() There are also a set of methods which allow us to obtain individual values. print(df_SAFI['B_no_membrs'].min()) print(df_SAFI['B_no_membrs'].max()) print(df_SAFI['B_no_membrs'].mean()) print(df_SAFI['B_no_membrs'].std()) print(df_SAFI['B_no_membrs'].count()) print(df_SAFI['B_no_membrs'].sum()) 2 19 7.190839694656488 3.1722704895263734 131 942 Unlike the describe() method which converts the variable to a float (when it was originally an integer), the individual summary methods only does so for the returned result if needed. We can do the same thing for the E19_period_use variable print(df_SAFI['E19_period_use'].min()) print(df_SAFI['E19_period_use'].max()) print(df_SAFI['E19_period_use'].mean()) print(df_SAFI['E19_period_use'].std()) print(df_SAFI['E19_period_use'].count()) print(df_SAFI['E19_period_use'].sum()) 1.0 45.0 12.043478260869565 8.583030848015385 92 1108.0
Dealing with missing valuesWe can find out how many variables in our Dataframe contains any NaN values with the code Column1 0 A01_interview_date 0 A03_quest_no 0 A04_start 0 ... or for a specific variable df_SAFI['E19_period_use'].isnull().sum() Data from most sources has the potential to include missing data. Whether or not this presents a problem at all depends on what you are planning to do. We have been using data from two very different sources. The SN7577 dataset is provided by the UK Data Service. Datasets from the UK data Service, have already been ‘cleaned’ and it is unlikely that there will be any genuinely missing data. However you may find that data which was missing has been replaced with a value such as ‘-1’ or ‘Not Specified’. In cases like these it may be appropriate to replace these values with ‘NaN’ before you try to process the data further. The SAFI dataset we have been using comes from a project called ‘Studying African Farmer-led Irrigation’. The data for this project is questionnaire based, but rather than using a paper-based questionnaire, it has been created and is completed electronically via an app on a smartphone. This provides flexibility in the design and presentation of the questionnaire; a section of the questionnaire may only be presented depending on the answer given to some preceding question. This means that there can quite legitimately be a set of ‘NaN’ values in a record (one complete questionnaire) where you would still consider the record to be complete. We have already seen how we can check for missing values. There are three other actions we need to be able to do:
With these options we can ensure that the data is suitable for the further processing we have planned. The dropna() method will delete all rows if any of the variables contain an NaN. For some datasets this may be acceptable. You will need to take care that you have enough rows left for your analysis to have meaning. df_SAFI = pd.read_csv("SAFI_results.csv") print(df_SAFI.shape) df_SAFI.dropna(inplace=True) print(df_SAFI.shape) Because there are variables in the SAFI dataset which are all NaN using the dropna() method effectively deletes all of the rows from the Dataframe, probably not what you wanted. Instead we can use the notnull() method as a row selection criteria and delete the rows where a specific variable has NaN values. df_SAFI = pd.read_csv("SAFI_results.csv") print(df_SAFI.shape) df_SAFI = df_SAFI[(df_SAFI['E_no_group_count'].notnull())] print(df_SAFI.shape) Replace NaN with a value of our choiceThe E19_period_use variable answers the question: “For how many years have you been irrigating the land?”. In some cases the land is not irrigated and these are represented by NaN in the dataset. So when we run df_SAFI['E19_period_use'].describe() we get a count value of 92 and all of the other statistics are based on this count value. Now supposing that instead of NaN the interviewer entered a value of 0 to indicate the land which is not irrigated has been irrigated for 0 years, technically correct. To see what happens we can convert all of the NaN values in the E19_period_use column to 0 with the following code: df_SAFI['E19_period_use'].fillna(0, inplace=True) If we now run the describe() again you can see that all of the statistic have been changed because the calculations are NOW based on a count of 131. Probably not what we would have wanted. Conveniently this allows us to demonstrate our 3rd action. Replace specific values with NaNAlthough we can recognise NaN with methods like isnull() or dropna() actually creating a NaN value and putting it into a Dataframe, requires the numpy module. The following code will replace our 0 values with NaN. We can demonstrate that this has occurred by running the describe() again and see that we now have our original values back. import numpy as np df_SAFI['E19_period_use'].replace(0, np.NaN, inplace = True) df_SAFI['E19_period_use'].describe() Categorical variablesFor categorical variables, numerical statistics don’t make any sense. For a categorical variable we can obtain a list of unique values used by the variable by using the unique() method. df_SAFI = pd.read_csv("SAFI_results.csv") pd.unique(df_SAFI['C01_respondent_roof_type']) array(['grass', 'mabatisloping', 'mabatipitched'], dtype=object) Knowing all of the unique values is useful but what is more useful is knowing how many occurrences of each there are. In order to do this we can use the groupby method. Having performed the groupby() we can them describe() the results. The format is similar to that which we have seen before except that the ‘grouped by’ variable appears to the left and there is a set of statistics for each unique value of the variable. grouped_data = df_SAFI.groupby('C01_respondent_roof_type') grouped_data.describe() You can group by more than one variable at a time by providing them as a list. grouped_data = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type']) grouped_data.describe() You can also obtain individual statistics if you want. A11_years_farm = df_SAFI.groupby(['C01_respondent_roof_type', 'C02_respondent_wall_type'])['A11_years_farm'].count() A11_years_farm C01_respondent_roof_type C02_respondent_wall_type grass burntbricks 22 muddaub 42 sunbricks 9 mabatipitched burntbricks 6 muddaub 3 ...
Page 7
There are many occasions when we have related data spread across multiple files. The data can be related to each other in different ways. How they are related and how completely we can join the data from the datasets will vary. In this episode we will consider different scenarios and show we might join the data. We will use csv files and in all cases the first step will be to read the datasets into a pandas Dataframe from where we will do the joining. The csv files we are using are cut down versions of the SN7577 dataset to make the displays more manageable. First, let’s download the datafiles. They are listed in the setup page for the lesson. Alternatively, you can download the GitHub repository for this lesson. The data files are in the data directory. If you’re using Jupyter, make sure to place these files in the same directory where your notebook file is. Scenario 1 - Two data sets containing the same columns but different rows of dataHere we want to add the rows from one Dataframe to the rows of the other Dataframe. In order to do this we can use the pd.concat() function. import pandas as pd df_SN7577i_a = pd.read_csv("SN7577i_a.csv") df_SN7577i_b = pd.read_csv("SN7577i_b.csv") Have a quick look at what these Dataframes look like with print(df_SN7577i_a) print(df_SN7577i_b) Id Q1 Q2 Q3 Q4 0 1 1 -1 1 8 1 2 3 -1 1 4 2 3 10 3 2 6 3 4 9 -1 10 10 ... Id Q1 Q2 Q3 Q4 0 1277 10 10 4 6 1 1278 2 -1 5 4 2 1279 2 -1 4 5 3 1280 1 -1 2 3 ... The concat() function appends the rows from the two Dataframes to create the df_all_rows Dataframe. When you list this out you can see that all of the data rows are there, however, there is a problem with the index. df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b]) df_all_rows We didn’t explicitly set an index for any of the Dataframes we have used. For df_SN7577i_a and df_SN7577i_b default indexes would have been created by pandas. When we concatenated the Dataframes the indexes were also concatenated resulting in duplicate entries. This is really only a problem if you need to access a row by its index. We can fix the problem with the following code. df_all_rows=df_all_rows.reset_index(drop=True) # or, alternatively, there's the `ignore_index` option in the `pd.concat()` function: df_all_rows = pd.concat([df_SN7577i_a, df_SN7577i_b], ignore_index=True) df_all_rows What if the columns in the Dataframes are not the same? df_SN7577i_aa = pd.read_csv("SN7577i_aa.csv") df_SN7577i_bb = pd.read_csv("SN7577i_bb.csv") df_all_rows = pd.concat([df_SN7577i_aa, df_SN7577i_bb]) df_all_rows In this case df_SN7577i_aa has no Q4 column and df_SN7577i_bb has no Q3 column. When they are concatenated, the resulting Dataframe has a column for Q3 and Q4. For the rows corresponding to df_SN7577i_aa the values in the Q4 column are missing and denoted by NaN. The same applies to Q3 for the df_SN7577i_bb rows. Scenario 2 - Adding the columns from one Dataframe to those of another Dataframedf_SN7577i_c = pd.read_csv("SN7577i_c.csv") df_SN7577i_d = pd.read_csv("SN7577i_d.csv") df_all_cols = pd.concat([df_SN7577i_c, df_SN7577i_d], axis = 1) df_all_cols We use the axis=1 parameter to indicate that it is the columns that need to be joined together. Notice that the Id column appears twice, because it was a column in each dataset. This is not particularly desirable, but also not necessarily a problem. However, there are better ways of combining columns from two Dataframes which avoid this problem. Scenario 3 - Using merge to join columnsWe can join columns from two Dataframes using the merge() function. This is similar to the SQL ‘join’ functionality. A detailed discussion of different join types is given in the SQL lesson. You specify the type of join you want using the how parameter. The default is the inner join which returns the columns from both tables where the key or common column values match in both Dataframes. The possible values of the how parameter are shown in the picture below (taken from the Pandas documentation) The different join types behave in the same way as they do in SQL. In Python/pandas, any missing values are shown as NaN In order to merge the Dataframes we need to identify a column common to both of them. df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner') df_cd In fact, if there is only one column with the same name in each Dataframe, it will be assumed to be the one you want to join on. In this example the Id column Leaving the join column to default in this way is not best practice. It is better to explicitly name the column using the on parameter. df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', on = 'Id') In many circumstances, the column names that you wish to join on are not the same in both Dataframes, in which case you can use the left_on and right_on parameters to specify them separately. df_cd = pd.merge(df_SN7577i_c, df_SN7577i_d, how='inner', left_on = 'Id', right_on = 'Id')
Page 8
In the SN7577 dataset that we have been using there is a group of columns which record which daily newspapers each respondent reads. Despite the un-informative names like ‘daily1’ each column refers to a current UK daily national or local newspaper. Whether the paper is read or not is recorded using the values of 0 or 1 as a boolean indicator. The advantage of using a column for each paper means that should a respondent read multiple newspapers, all of the required information can still be recorded in a single record. Recording information in this wide format is not always beneficial when trying to analyse the data. Pandas provides methods for converting data from wide to long format and from long to wide format The SN7577 dataset does not contain a variable that can be used to uniquely identify a row. This is often referred to as a ‘primary key’ field (or column). A dataset doesn’t need to have such a key. None of the work we have done so far has required it. When we create a pandas Dataframe by importing a csv file, we have seen that pandas will create an index for the rows. This index can be used a bit like a key field, but as we have seen there can be problems with the index when we concatenate two Dataframes together. In the version of SN7577 that we are going to use to demonstrate long and wide formats we will add a new variable with the name of ‘Id’ and we will restrict the other columns to those starting with the word ‘daily’. import pandas as pd df_SN7577 = pd.read_csv("SN7577.tab", sep='\t') We will create a new Dataframe with a single column of ‘Id’. # create an 'Id' column df_papers1 = pd.DataFrame(pd.Series(range(1,1287)), index=None, columns=['Id']) Using the range function, we can create values of Id starting with 1 and going up to 1286 (remember the second parameter to range is one past the last value used.) We have explicitly coded this value because we knew how many rows were in the dataset. If we didn’t, we could have used We will create a 2nd Dataframe, based on SN7577 but containing only the columns starting with the word ‘daily’. There are several ways of doing this, we’ll cover the way that we have covered all of the prerequisites for. We will use the filter method of pandas with its like parameter. df_papers2 = df_SN7577.filter(like= 'daily') The value supplied to like can occur anywhere in the column name to be matched (and therefore selected).
To create the Dataframe that we will use, we will concatenate the two Dataframes we have created. df_papers = pd.concat([df_papers1, df_papers2], axis = 1) print(df_papers.index) print(df_papers.columns) RangeIndex(start=0, stop=1286, step=1) Index(['Id', 'daily1', 'daily2', 'daily3', 'daily4', 'daily5', 'daily6', 'daily7', 'daily8', 'daily9', 'daily10', 'daily11', 'daily12', 'daily13', 'daily14', 'daily15', 'daily16', 'daily17', 'daily18', 'daily19', 'daily20', 'daily21', 'daily22', 'daily23', 'daily24', 'daily25'], dtype='object') We use axis = 1 because we are joining by columns, the default is joining by rows (axis=0). From ‘wide’ to ‘long’To make the displays more manageable we will use only the first eight ‘daily’ columns ## using df_papers daily_list = df_papers.columns[1:8] df_daily_papers_long = pd.melt(df_papers, id_vars = ['Id'], value_vars = daily_list) # by default, the new columns created will be called 'variable' which is the name of the 'daily' # and 'value' which is the value of that 'daily' for that 'Id'. So, we will rename the columns df_daily_papers_long.columns = ['Id','Daily_paper','Value'] df_daily_papers_long We now have a Dataframe that we can groupby. We want to groupby the Daily_paper and then sum the Value. a = df_daily_papers_long.groupby('Daily_paper')['Value'].sum() a Daily_paper daily1 0 daily2 26 daily3 52 From Long to WideThe process can be reversed by using the pivot() method. Here we need to indicate which column (or columns) remain fixed (this will become an index in the new Dataframe), which column contains the values which are to become column names and which column contains the values for the columns. In our case we want to use the Id column as the fixed column, the Daily_paper column contains the column names and the Value column contains the values. df_daily_papers_wide = df_daily_papers_long.pivot(index = 'Id', columns = 'Daily_paper', values = 'Value') We can change our Id index back to an ordinary column with df_daily_papers_wide.reset_index(level=0, inplace=True)
Page 9
There is a wide variety of ways to plot in Python, like many programming languages. Some do more of the design work for you and others let you customize the look of the plots and all of the little details yourself. Pandas has basic plots built into it that reduce the amount of syntax, if your data is already in a DataFrame. Matplotlib. is a Python graphical library that can be used to produce a variety of different graph types, it is fully controllable down to basic elements and includes a module pylab that is somewhere in between (designed to feel like MATLAB plotting, if you happen to have done that before). The Matplotlib library can be imported using any of the import techniques we have seen. As Pandas is generally imported with import pandas as pd, you will find that Matplotlib is most commonly imported with import matplotlib as plt where ‘plt’ is the alias. In addition to importing the library, in a Jupyter notebook environment we need to tell Jupyter that when we produce a graph, we want it to be display the graph in a cell in the notebook just like any other results. To do this we use the %matplotlib inline directive. Although we are using Matplotlib in this episode, Pandas can make use of several other graphical libraries available from within Python such as ggplot2 and Seaborn. Seaborn has some very powerful features and advanced plot types. One of its most useful features is formatting. Plotting with PandasThe Pandas library contains very tight integration with Matplotlib. There are functions in Pandas that automatically call Matplotlib functions to produce graphs. To plot with Pandas we have to import it as we have done in past episodes. To tell Jupyter that when we produce a graph we want it to be displayed in a cell in the notebook just like any other results, we use the %matplotlib inline directive. Without that we need to do a show() command. import pandas as pd %matplotlib inline We also need data to work with loaded into a DataFrame and it’s helpful to look at a few rows to remember what’s there. We are going to use the dataset from the setup page, SAFI_full_shortname.csv. For the data to load, make sure to have that file in the same folder where your Jupyter notebook is running. If the file is not in that folder, you are going to have to type the full path. df = pd.read_csv("SAFI_full_shortname.csv") df.head() Next, we can plot the a histogram of a variable. We can change the number of bins to make it look how we would like, for example df['years_liv'].hist(bins=20) We can also specify the column as a parameter and a groupby column with the by keyword. there are a lot of keywords available to make it look better, we can see some of the most likely ones (as decided by Pandas developers) by using shift + tab . Lets try layout, figsize, and sharex. df.hist(column='years_liv',by='village',layout=(1,3),figsize=(12,3),sharex=True) Scatter plotThe scatter plot requires the x and y coordinates of each of the points being plotted. We can add a third dimension as different colors with the c argument. df.plot.scatter(x='gps_Latitude', y='gps_Longitude', c='gps_Altitude', colormap="viridis", figsize=[4,4])
BoxplotA boxplot provides a simple representation of a variety of statistical qualities of a single set of data values. A common use of the boxplot is to compare the statistical variations across a set of variables. The variables can be an independent series or columns of a Dataframe using the Pandas plot method df.boxplot(by ='village',column=['buildings_in_compound']) We can make it look prettier with Seaborn, much more easily than fixing components manually with Matplotlib. Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn comes with Anaconda; to make it available in our Python session we need to import it. import seaborn as sns sns.boxplot(data = df, x = 'village', y = 'buildings_in_compound') We can also draw linear models in a plot using lmplot() from seaborn, e.g. for years_farm vs years_liv per village. sns.lmplot(x='years_farm', y='years_liv',data=df,hue='village') In general, most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion. The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph. In Matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the show function. Using this same approach, we can plot two sets of data on the same graph. We will use a scatter plot to demonstrate some of the available features. Fine-tuning figures with MatplotlibIf we want to do more advanced or lower level things with our plots, we need to use Matplotlib directly, not through Pandas. First we need to import it. Customising our plots with MatplotlibWe can further customise our plots with Matplotlib directly. First we need to import it. The Matplotlib library can be imported using any of the import techniques we have seen. As Pandas is generally imported with import pandas as pd, you will find that matplotlib is most commonly imported with import matplotlib.pyplot as plt where plt is the alias. For demonstration purposes, we are going to use randomly generated data, using the NumPy library (aliased here as np). import numpy as np import pandas as pd import matplotlib.pyplot as plt # Generate some date for 2 sets of points. x1 = pd.Series(np.random.rand(20) - 0.5) y1 = pd.Series(np.random.rand(20) - 0.5) x2 = pd.Series(np.random.rand(20) + 0.5) y2 = pd.Series(np.random.rand(20) + 0.5) # Add some features plt.title('Scatter Plot') plt.ylabel('Range of y values') plt.xlabel('Range of x values') # plot the points in a scatter plot plt.scatter(x1, y1, c='red', label='Red Range') # 'c' parameter is the colour and 'label' is the text for the legend plt.scatter(x2, y2, c='blue', label='Blue Range') plt.legend(loc=4) # the locations 1,2,3 and 4 are top-right, top-left, bottom-left and bottom-right # Show the graph with the two sets of points plt.show() In the call to the scatter method, the label parameter values are used by the legend. The c or color parameter can be set to any color Matplotlib recognises. Full details of the available colours are available in the Matplotlib website. The markers section will tell you what markers you can use instead of the default ‘dots’. There is also an s (size) parameter which allows you to change the size of the marker.
Internally the Pandas ‘plot’ method has called the ‘bar’ method of Matplotlib and provided a set of parameters, including the pandas.Series s to generate the graph. We can use Matplotlib directly to produce a similar graph. In this case we need to pass two parameters, the number of bars we need and the Pandas Series holding the values. Let’s redo the boxplot we did above: df.boxplot(column = 'buildings_in_compound', by = 'village') The automatic title of the plot does not look good, we are missing a title for the y-axis and we do not need the extra x-axis title. We can also remove the gridlines. Let’s fix these things using functions from plt. Note: all the adjustments for the plot have to go into the same notebook cell together with the plot statement itself. df.boxplot(column = 'buildings_in_compound', by = 'village') plt.suptitle('') # remove the automatic title plt.title('Buildings in compounds per village') # add a title plt.ylabel('Number of buildings') # add a y-axis title plt.xlabel('') # remove the x-axis title plt.grid(None) # remove the grid lines In general most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion. The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph. In matplotlib you create the graph by providing values for all of the individual components you choose to include. Saving a graphIf you wish to save your graph as an image you can do so using the plt.savefig() function. The image can be saved as a pdf, jpg or png file by changing the file extension. plt.savefig() needs to be called at the end of all your plot statements in the same notebook cell. df.boxplot(column = 'buildings_in_compound', by = 'village') plt.suptitle('') # remove the automatic title plt.title('Buildings in compounds per village') # add a title plt.ylabel('Number of buildings') # add a y-axis title plt.xlabel('') # remove the x-axis title plt.grid(None) # remove the grid lines plt.savefig('safi_boxplot_buildings.pdf') # save as pdf file plt.savefig('safi_boxplot_buildings.png', dpi = 150) # save as png file, some extra arguments are provided In general most graphs can be broken down into a series of elements which, although typically related in some way, can all exist independently of each other. This allows us to create the graph in a rather piecemeal fashion. The labels (if any) on the x and y axis are independent of the data values being represented. The title and the legend are also independent objects within the overall graph. In Matplotlib you create the graph by providing values for all of the individual components you choose to include. When you are ready, you call the show function. Using this same approach we can plot two sets of data on the same graph. We will use a scatter plot to demonstrate some of the available features.
Page 10
SQLite is a relational database system. Despite the ‘Lite’ in the name it can handle databases in excess of a Terabyte. The ‘Lite’part really relates to the fact that it is a ‘bare bones’ system. It provides the mechanisms to create and query databases via a simple command line interface but not much else. In the SQL lesson we used a Firefox plugin to provide a GUI (Graphical User Interface) to the SQLite database engine. In this lesson we will use Python code using the sqlite3 module to access the engine. We can use Python code and the sqlite3 module to create, delete and query database tables. In practice we spend a lot of the time querying database tables. Pandas Dataframe v SQL tableIt is very easy and often very convenient to think of SQL tables and pandas Dataframes as being similar types of objects. All of the data manipulations, slicing, dicing, aggragetions and joins associated with SQL and SQL tables can all be accomplished with pandas methods operating on a pandas Dataframe. The difference is that the pandas Dataframe is held in memory within the Python environment. The SQL table can largely be on disc and when you access it, it is the SQLite database engine which is doing the work. This allows you to work with very large tables which your Python environment may not have the memory to hold completely. A typical use case for SQLite databases is to hold large datasets, you use SQL commands from Python to slice and dice and possibly aggregate the data within the database system to reduce the size to something that Python can comfortably process and then return the results to a Dataframe. Accessing data stored in SQLite using PythonWe will illustrate the use of the sqlite3 module by connecting to an SQLite database using both core Python and also using pandas. The database that we will use is SN7577.sqlite This contains the data from the SN7577 dataset that we have used in other lessons. Connecting to an SQlite databaseThe first thing we need to do is import the sqlite3 library, We will import pandas at the same time for convenience. import sqlite3 import pandas as pd We will start looking at the sqlite3 library by connecting to an existing database and returning the results of running a query. Initially we will do this without using Pandas and then we will repreat the exercise so that you can see the difference. The first thing we need to do is to make a connection to the database. An SQLite database is just a file. To make a connection to it we only need to use the sqlite3 connect() function and specify the database file as the first parameter. The connection is assigned to a variable. You could use any variable name, but ‘con’ is quite commonly used for this purpose con = sqlite3.connect('SN7577.sqlite') The next thing we need to do is to create a cursor for the connection and assign it to a variable. We do this using the cursor method of the connection object. The cursor allows us to pass SQL statements to the database, have them executed and then get the results back. To execute an SQL statement we use the execute() method of the cursor object. The only paramater we need to pass to execute() is a string which contains the SQL query we wish to execute. In our example we are passing a literal string. It could have been contained in a string variable. The string can contain any valid SQL query. It could also be a valid DDL statement such as a “CREATE TABLE …”. In this lesson however we will confine ourseleves to querying exiting database tables. cur = con.cursor() cur.execute("SELECT * FROM SN7577") <sqlite3.Cursor at 0x115e10d50> The execute() method doesn’t actually return any data, it just indicates that we want the data provided by running the SELECT statement.
Before we can make use of the results of the query we need to use the fetchall() method of the cursor. The fetchall() method returns a list. Each item in the list is a tuple containing the values from one row of the table. You can iterate through the items in a tuple in the same way as you would do so for a list. cur = con.cursor() cur.execute("SELECT * FROM SN7577") rows = cur.fetchall() for row in rows: print(row) (1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3 ... The output is the data only, you do not get the column names. The column names are available from the ‘description’ of the cursor. colnames = [] for description in cur.description : colnames.append(description[0]) print(colnames) ['Q1', 'Q2', 'Q3', 'Q4', 'Q5ai', 'Q5aii', 'Q5aiii', 'Q5aiv', 'Q5av', 'Q5avi', 'Q5avii', 'Q5aviii', 'Q5aix', 'Q5ax', 'Q5axi', 'Q5axii', 'Q5axiii', 'Q5axiv', 'Q5axv', 'Q5bi', 'Q5bii', 'Q5biii', 'Q5biv', 'Q5bv', 'Q5bvi', 'Q5bvii', 'Q5bviii', 'Q5bix', 'Q5bx', 'Q5bxi', 'Q5bxii', 'Q5bxiii', 'Q5bxiv', 'Q5bxv', 'Q6', 'Q7a', 'Q7b', 'Q8', 'Q9', 'Q10a', 'Q10b', 'Q10c', 'Q10d', 'Q11a', ... One reason for using a database is the size of the data involved. Consequently it may not be practial to use fetchall() as this will return the complete result of your query. An alternative is to use the fetchone() method, which as the name suggestrs returns only a single row. The cursor keeps track of where you are in the results of the query, so the next call to fetchone() will return the next record. When there are no more records it will return ‘None’. cur = con.cursor() cur.execute("SELECT * FROM SN7577") row = cur.fetchone() print(row) row = cur.fetchone() print(row) (1, -1, 1, 8, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 3, 2, 3, 3, 4, 1, 4, 2, 2, 2, 2, 1, 0, 0, 0, 3, 2, 3, 3, 1, 4, 2, 3, 2, 4, 4, 2, 2, 2, 4, 2, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0
Using Pandas to read a database table.When you use Pandas to read a database table, you connect to the database in the same way as before using the SQLite3 connect() function and providing the filename of the database file. Pandas has a method read_sql_query to which you provide both the string containing the SQL query you wish to run and also the connection variable. The results from running the query are placed in a pandas Dataframe with the table column names automatically added. con = sqlite3.connect('SN7577.sqlite') df = pd.read_sql_query("SELECT * from SN7577", con) # verify that result of SQL query is stored in the Dataframe print(type(df)) print(df.shape) print(df.head()) con.close() Saving a Dataframe as an SQLite tableThere may be occasions when it is convenient to save the data in you pandas Dataframe as an SQLite table for future use or for access to other systems. This can be done using the to_sql() method. con = sqlite3.connect('SN7577.sqlite') df = pd.read_sql_query("SELECT * from SN7577", con) # select only the row where the response to Q1 is 10 meaning undecided voter df_undecided = df[df.Q1 == 10] print(df_undecided.shape) # Write the new Dataframe to a new SQLite table df_undecided.to_sql("Q1_undecided", con) # If you want to overwrite an existing SQLite table you can use the 'if_exists' parameter #df_undecided.to_sql("Q1_undecided", con, if_exists="replace") con.close() Deleting an SQLite tableIf you have created tables in an SQLite database, you may also want to delete them. You can do this by using the sqlite3 cursor execute() method con = sqlite3.connect('SN7577.sqlite') cur = con.cursor() cur.execute('drop table if exists Q1_undecided') con.close()
|