The following article explains how to work with the datasets we provide.
FutureLearn adopted the CSV standard as its export/interchange format to allow maximum compatibility with the widest range of analysis toolsets. However, it is important to remember that importing these files into your application of choice may involve additional steps to ensure that they are interpreted correctly.
While it is impractical for FutureLearn to offer individualised support for different software packages, the below sections provide some guidance based on our experiences to date.
Using the datasets in Microsoft Excel presents several challenges arising from the way it handles CSV importing/opening, and our desire to present data unaltered.
When working with all of the datasets, you will find it necessary to “import” rather than “open” the files (e.g. not just double clicking the CSV), so that through Excel’s import wizard, you will be able to set the classes of each column. It is important that you set the step number column to be interpreted as “Text” rather than “General”. This will prevent Excel from assuming it is a standard floating point number, and equating 1.10 with 1.1.
When working with the comment or peer review datasets in Excel, you will need to undertake some additional data cleaning. The text column in these files (representing learner-generated content) can span multiple lines with line breaks and carriage returns. Interestingly, when opening a CSV, Excel will correctly handle multiple-line comments, but incorrectly interpret step numbers as floating points. When importing, the step numbers can be correctly interpreted, but there is no combination of settings which will result in the text column being correctly handled.
There are two approaches that can be used here, depending on your intended analysis. You can:
- Wrap the step number in single quotes to force Excel to see it as a string. This can be achieved using ‘sed’ which is installed on Mac OSX/Unix machines, or can be downloaded for Windows:
sed -E 's/,([0-9]+)\.([0-9]+),/,\1;\2,/g' course_comments.csv > course_comments_new.csv
You can then open/double click the CSV file, select the step column and set its type to “Text”, and if desired Find All/Replace All the single quote with an empty string.
- Strip carriage returns from the text column and then import the file, ensuring the step number is interpreted as text.
SPSS allows you write scripts in the Python language which can interact with your workspace. The following script will load and correctly format the comments dataset. Note, you will need to amend the filename to point to the full path of your dataset.
BEGIN PROGRAM PYTHON. import spss, csv, re def remove_quotes(s): return ''.join(c for c in s if c not in ('"', "'")).replace('\n', ' ').replace('\r', '') spss.StartDataStep() dsObj = spss.Dataset(name=None) dsName = dsObj.name dsObj.varlist.append('id', 16) dsObj.varlist.append('author_id', 43) dsObj.varlist.append('parent_id', 16) dsObj.varlist.append('step', 5) dsObj.varlist.append('text', 32767) dsObj.varlist.append('timestamp', 23) dsObj.varlist.append('moderated', 23) dsObj.varlist.append('likes', 8) with open("course_comments.csv","rb") as infile: reader = csv.reader(infile) for line in reader: dsObj.cases.append([remove_quotes(elem) for elem in line]) spss.EndDataStep() spss.Submit('DATASET ACTIVATE %s' %dsName) END PROGRAM.
Importing the datasets into R is a fairly simple task – on a base-level, you can use the inbuilt read.csv() function to inject them into a data frame. However, the resulting data frames may represent some of the columns in ways which are less useful for ongoing analysis. Instead, it may be advisable to be more declarative about the data types as shown below:
stepActivity <- read.csv("course_step-activity.csv", colClasses = c("factor", "factor", "character", "character")) comments <- read.csv("course_comments.csv", colClasses = c("numeric", "factor", "numeric", "factor", "character", "character", "character", "numeric")) enrolments <- read.csv("course_enrolments.csv", colClasses = c("factor", "character", "character")) questionResponse <- read.csv("course_question-response.csv", colClasses = c("factor", "factor", "factor", "character", "logical")) peerReviewAssignments <- "course_peer-review-assignments.csv", colClasses = c("numeric", "factor", "factor", "character", "character", "character", "character", "numeric")) peerReviewReviews <- read.csv("course_peer-review-reviews.csv", colClasses = c("numeric", "factor", "factor", "numeric", "character", "character", "character", "character"))